Services¶

Services allow you to deploy models or web apps as secure and scalable endpoints.

Apply a configuration¶

First, define a service configuration as a YAML file in your project folder. The filename must end with .dstack.yml (e.g. .dstack.yml or dev.dstack.yml are both acceptable).

type: service
name: llama31

# If `image` is not specified, dstack uses its default image
python: 3.12
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - uv pip install vllm
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN
    --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# (Optional) Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu: 24GB

To run a service, pass the configuration to dstack apply:

$ HF_TOKEN=...
$ dstack apply -f .dstack.yml

 #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:3  yes   $0.33

Submit the run llama31? [y/n]: y

Provisioning...
---> 100%

Service is published at: 
  http://localhost:3000/proxy/services/main/llama31/
Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
  http://localhost:3000/proxy/models/main/

dstack apply automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes), and runs the service.

If a gateway is not configured, the service’s endpoint will be accessible at <dstack server URL>/proxy/services/<project name>/<run name>/.

$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming."
            }
        ]
    }'

If the service defines the model property, the model can be accessed with the global OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/, or via dstack UI.

If authorization is not disabled, the service endpoint requires the Authorization header with Bearer <dstack token>.

Gateway

Running services for development purposes doesn’t require setting up a gateway.

However, you'll need a gateway in the following cases:

To use auto-scaling or rate limits
To enable HTTPS for the endpoint and map it to your domain
If your service requires WebSockets
If your service cannot work with a path prefix

Note, if you're using dstack Sky , a gateway is already pre-configured for you.

If a gateway is configured, the service endpoint will be accessible at https://<run name>.<gateway domain>/.

If the service defines the model property, the model will be available via the global OpenAI-compatible endpoint at https://gateway.<gateway domain>/.

Configuration options¶

Replicas and scaling¶

By default, dstack runs a single replica of the service. You can configure the number of replicas as well as the auto-scaling rules.

type: service
name: llama31-service

python: 3.12

env:
  - HF_TOKEN
commands:
  - uv pip install vllm
  - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
port: 8000

resources:
  gpu: 24GB

replicas: 1..4
scaling:
  # Requests per seconds
  metric: rps
  # Target metric value
  target: 10

The replicas property can be a number or a range.

The metric property of scaling only supports the rps metric (requests per second). In this case dstack adjusts the number of replicas (scales up or down) automatically based on the load.

Setting the minimum number of replicas to 0 allows the service to scale down to zero when there are no requests.

The scaling property requires creating a gateway.

Model¶

If the service is running a chat model with an OpenAI-compatible interface, set the model property to make the model accessible via dstack's global OpenAI-compatible endpoint, and also accessible via dstack's UI.

Authorization¶

By default, the service enables authorization, meaning the service endpoint requires a dstack user token. This can be disabled by setting auth to false.

type: service
name: http-server-service

# Disable authorization
auth: false

python: 3.12

commands:
  - python3 -m http.server
port: 8000

Path prefix¶

If your dstack project doesn't have a gateway, services are hosted with the /proxy/services/<project name>/<run name>/ path prefix in the URL. When running web apps, you may need to set some app-specific settings so that browser-side scripts and CSS work correctly with the path prefix.

type: service
name: dash
gateway: false

auth: false
# Do not strip the path prefix
strip_prefix: false

env:
  # Configure Dash to work with a path prefix
  # Replace `main` with your dstack project name
  - DASH_ROUTES_PATHNAME_PREFIX=/proxy/services/main/dash/

commands:
  - uv pip install dash
  # Assuming the Dash app is in your repo at app.py
  - python app.py

port: 8050

By default, dstack strips the prefix before forwarding requests to your service, so to the service it appears as if the prefix isn't there. This allows some apps to work out of the box. If your app doesn't expect the prefix to be stripped, set strip_prefix to false.

If your app cannot be configured to work with a path prefix, you can host it on a dedicated domain name by setting up a gateway.

Rate limits¶

If you have a gateway, you can configure rate limits for your service using the rate_limits property.

type: service
image: my-app:latest
port: 80

rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
  rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
  burst: 9

The rps limit sets the max requests per second, tracked in milliseconds (e.g., rps: 4 means 1 request every 250 ms). Use burst to allow short spikes while keeping the average within rps.

Limits apply to the whole service (all replicas) and per client (by IP). Clients exceeding the limit get a 429 error.

Partitioning key

Instead of partitioning requests by client IP address, you can choose to partition by the value of a header.

type: service
image: my-app:latest
port: 80

rate_limits:
- rps: 4
  burst: 9
  # Apply to each user, as determined by the `Authorization` header
  key:
    type: header
    header: Authorization

Resources¶

If you specify memory size, you can either specify an explicit size (e.g. 24GB) or a range (e.g. 24GB.., or 24GB..80GB, or ..80GB).

type: service
name: llama31-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - uv pip install vllm
  - |
    vllm serve $MODEL_ID
      --max-model-len $MAX_MODEL_LEN
      --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000

resources:
  # 16 or more x86_64 cores
  cpu: 16..
  # 2 GPUs of 80GB
  gpu: 80GB:2

  # Minimum disk size
  disk: 200GB

The cpu property lets you set the architecture (x86 or arm) and core count — e.g., x86:16 (16 x86 cores), arm:8.. (at least 8 ARM cores). If not set, dstack infers it from the GPU or defaults to x86.

The gpu property lets you specify vendor, model, memory, and count — e.g., nvidia (one NVIDIA GPU), A100 (one A100), A10G,A100 (either), A100:80GB (one 80GB A100), A100:2 (two A100), 24GB..40GB:2 (two GPUs with 24–40GB), A100:40GB:2 (two 40GB A100s).

If vendor is omitted, dstack infers it from the model or defaults to nvidia.

Shared memory

If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure shm_size, e.g. set it to 16GB.

If you’re unsure which offers (hardware configurations) are available from the configured backends, use the dstack offer command to list them.

Docker¶

Default image¶

If you don't specify image, dstack uses its base Docker image pre-configured with uv, python, pip, essential CUDA drivers, and NCCL tests (under /opt/nccl-tests/build).

Set the python property to pre-install a specific version of Python.

type: service
name: http-server-service    

python: 3.12

commands:
  - python3 -m http.server
port: 8000

NVCC¶

By default, the base Docker image doesn’t include nvcc, which is required for building custom CUDA kernels. If you need nvcc, set the nvcc property to true.

type: service
name: http-server-service    

python: 3.12
nvcc: true

commands:
  - python3 -m http.server
port: 8000

Custom image¶

If you want, you can specify your own Docker image via image.

type: service
name: http-server-service

image: python

commands:
  - python3 -m http.server
port: 8000

Docker in Docker¶

Set docker to true to enable the docker CLI in your service, e.g., to run Docker images or use Docker Compose.

type: service
name: chat-ui-task

auth: false

docker: true

working_dir: examples/misc/docker-compose
commands:
  - docker compose up
port: 9000

Cannot be used with python or image. Not supported on runpod, vastai, or kubernetes.

Privileged mode¶

To enable privileged mode, set privileged to true.

Not supported with runpod, vastai, and kubernetes.

Private registry¶

Use the registry_auth property to provide credentials for a private Docker registry.

type: service
name: serve-distill-deepseek

env:
  - NGC_API_KEY
  - NIM_MAX_MODEL_LEN=4096

image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b
registry_auth:
  username: $oauthtoken
  password: ${{ env.NGC_API_KEY }}
port: 8000

model: deepseek-ai/deepseek-r1-distill-llama-8b

resources:
  gpu: H100:1

Environment variables¶

type: service
name: llama-2-7b-service

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

If you don't assign a value to an environment variable (see HF_TOKEN above), dstack will require the value to be passed via the CLI or set in the current process.

System environment variables

The following environment variables are available in any run by default:

Name	Description
`DSTACK_RUN_NAME`	The name of the run
`DSTACK_REPO_ID`	The ID of the repo
`DSTACK_GPUS_NUM`	The total number of GPUs in the run

Files¶

By default, dstack automatically mounts the repo directory where you ran dstack init to any run configuration.

However, in some cases, you may not want to mount the entire directory (e.g., if it’s too large), or you might want to mount files outside of it. In such cases, you can use the files property.

type: service
name: llama-2-7b-service

files:
  - .:examples  # Maps the directory where `.dstack.yml` to `/workflow/examples`
  - ~/.ssh/id_rsa:/root/.ssh/id_rsa  # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

Each entry maps a local directory or file to a path inside the container. Both local and container paths can be relative or absolute.

If the local path is relative, it’s resolved relative to the configuration file.
If the container path is relative, it’s resolved relative to /workflow.

The container path is optional. If not specified, it will be automatically calculated.

type: service
name: llama-2-7b-service

files:
  - ../examples  # Maps `examples` (the parent directory of `.dstack.yml`) to `/workflow/examples`
  - ~/.ssh/id_rsa  # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

Note: If you want to use files without mounting the entire repo directory, make sure to pass --no-repo when running dstack apply:

$ dstack apply -f examples/.dstack.yml --no-repo

.gitignore and .dstackignore

dstack automatically excludes files and folders listed in .gitignore and .dstackignore.

Uploads are limited to 2MB. To avoid exceeding this limit, make sure to exclude unnecessary files. You can increase the default server limit by setting the DSTACK_SERVER_CODE_UPLOAD_LIMIT environment variable.

Experimental

The files feature is experimental. Feedback is highly appreciated.

Retry policy¶

By default, if dstack can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail.

If you'd like dstack to automatically retry, configure the retry property accordingly:

type: service
image: my-app:latest
port: 80

retry:
  on_events: [no-capacity, error, interruption]
  # Retry for up to 1 hour
  duration: 1h

If one replica of a multi-replica service fails with retry enabled, dstack will resubmit only the failed replica while keeping active replicas running.

Spot policy¶

By default, dstack uses on-demand instances. However, you can change that via the spot_policy property. It accepts spot, on-demand, and auto.

Utilization policy¶

Sometimes it’s useful to track whether a service is fully utilizing all GPUs. While you can check this with dstack metrics, dstack also lets you set a policy to auto-terminate the run if any GPU is underutilized.

Below is an example of a service that auto-terminate if any GPU stays below 10% utilization for 1 hour.

type: service
name: llama-2-7b-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

utilization_policy:
  min_gpu_utilization: 10
  time_window: 1h

Creation policy¶

By default, when you run dstack apply with a dev environment, task, or service, if no idle instances from the available fleets meet the requirements, dstack creates a new fleet using configured backends.

To ensure dstack apply doesn't create a new fleet but reuses an existing one, pass -R (or --reuse) to dstack apply.

$ dstack apply -R -f examples/.dstack.yml

Or, set creation_policy to reuse in the run configuration.

Idle duration¶

If a fleet is created automatically, it stays idle for 5 minutes by default and can be reused within that time. If the fleet is not reused within this period, it is automatically terminated. To change the default idle duration, set idle_duration in the run configuration (e.g., 0s, 1m, or off for unlimited).

Fleets

For greater control over fleet provisioning, it is recommended to create fleets explicitly.

Reference

Services support many more configuration options, incl. backends, regions, max_price, and among others.

Manage runs¶

dstack provides several commands to manage runs:

dstack ps – Lists all running jobs and their statuses. Use --watch (or -w) to monitor the live status of runs.
dstack stop – Stops a run gracefully. Pass --abort or -x to stop it immediately without waiting for a graceful shutdown. By default, a run runs until you stop it or its lifetime exceeds the value of max_duration.
dstack attach – By default, dstack apply runs in attached mode, establishing an SSH tunnel to the run, forwarding ports, and displaying real-time logs. If you detach from a run, use this command to reattach.
dstack logs – Displays run logs. Pass --diagnose or -d to view diagnostic logs, which can help troubleshoot failed runs.

What's next?

Read about dev environments, tasks, and repos
Learn how to manage fleets
See how to set up gateways
Check the TGI , vLLM , and NIM examples