Using volumes to optimize cold starts on RunPod¶

Deploying custom models in the cloud often faces the challenge of cold start times, including the time to provision a new instance and download the model. This is especially relevant for services with autoscaling when new model replicas need to be provisioned quickly.

Let's explore how dstack optimizes this process using volumes, with an example of deploying a model on RunPod.

Suppose you want to deploy Llama 3.1 on RunPod as a service:

type: service
name: llama31-service-tgi

replicas: 1..2
scaling:
  metric: rps
  target: 30

image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_INPUT_LENGTH=4000
  - MAX_TOTAL_TOKENS=4096
commands:
  - text-generation-launcher
port: 80
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu: 24GB

When you run dstack apply, it creates a public endpoint with one service replica. dstack will then automatically scale the service by adjusting the number of replicas based on traffic.

When starting each replica, text-generation-launcher downloads the model to the /data folder. For Llama 3.1 8B, this usually takes under a minute, but larger models may take longer. Repeated downloads can significantly affect auto-scaling efficiency.

Great news: RunPod supports network volumes, which we can use for caching models across multiple replicas.

With dstack, you can create a RunPod volume using the following configuration:

type: volume
name: llama31-volume

backend: runpod
region: EU-SE-1

# Required size
size: 100GB

Go ahead and create it via dstack apply:

$ dstack apply -f examples/mist/volumes/runpod.dstack.yml

Once the volume is created, attach it to your service by updating the configuration file and mapping the volume name to the /data path.

type: service
name: llama31-service-tgi

replicas: 1..2
scaling:
  metric: rps
  target: 30

volumes:
 - name: llama31-volume
   path: /data

image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_INPUT_LENGTH=4000
  - MAX_TOTAL_TOKENS=4096
commands:
  - text-generation-launcher
port: 80
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu: 24GB

In this case, dstack attaches the specified volume to each new replica. This ensures the model is downloaded only once, reducing cold start time in proportion to the model size.

A notable feature of RunPod is that volumes can be attached to multiple containers simultaneously. This capability is particularly useful for auto-scalable services or distributed tasks.

Using volumes not only optimizes inference cold start times but also enhances the efficiency of data and model checkpoint loading during training and fine-tuning. Whether you're running tasks or dev environments, leveraging volumes can significantly streamline your workflow and improve overall performance.