Services¶

Services allow you to deploy models or web apps as secure and scalable endpoints.

Apply a configuration¶

First, define a service configuration as a YAML file in your project folder. The filename must end with .dstack.yml (e.g. .dstack.yml or dev.dstack.yml are both acceptable).

type: service
name: llama31

# If `image` is not specified, dstack uses its default image
python: 3.12
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - uv pip install vllm
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN
    --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# (Optional) Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu: 24GB

To run a service, pass the configuration to dstack apply:

$ HF_TOKEN=...
$ dstack apply -f .dstack.yml

 #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:3  yes   $0.33

Submit the run llama31? [y/n]: y

Provisioning...
---> 100%

Service is published at: 
  http://localhost:3000/proxy/services/main/llama31/
Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
  http://localhost:3000/proxy/models/main/

dstack apply automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes), and runs the service.

If a gateway is not configured, the service’s endpoint will be accessible at <dstack server URL>/proxy/services/<project name>/<run name>/.

$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming."
            }
        ]
    }'

If the service defines the model property, the model can be accessed with the global OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/, or via dstack UI.

If authorization is not disabled, the service endpoint requires the Authorization header with Bearer <dstack token>.

Gateway

Running services for development purposes doesn’t require setting up a gateway.

However, you'll need a gateway in the following cases:

To use auto-scaling or rate limits
To enable HTTPS for the endpoint and map it to your domain
If your service requires WebSockets
If your service cannot work with a path prefix

Note, if you're using dstack Sky , a gateway is already pre-configured for you.

If a gateway is configured, the service endpoint will be accessible at https://<run name>.<gateway domain>/.

If the service defines the model property, the model will be available via the global OpenAI-compatible endpoint at https://gateway.<gateway domain>/.

Configuration options¶

Replicas and scaling¶

By default, dstack runs a single replica of the service. You can configure the number of replicas as well as the auto-scaling rules.

type: service
name: llama31-service

python: 3.12

env:
  - HF_TOKEN
commands:
  - uv pip install vllm
  - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
port: 8000

resources:
  gpu: 24GB

replicas: 1..4
scaling:
  # Requests per seconds
  metric: rps
  # Target metric value
  target: 10

The replicas property can be a number or a range.

The metric property of scaling only supports the rps metric (requests per second). In this case dstack adjusts the number of replicas (scales up or down) automatically based on the load.

Setting the minimum number of replicas to 0 allows the service to scale down to zero when there are no requests.

The scaling property requires creating a gateway.

Model¶

If the service is running a chat model with an OpenAI-compatible interface, set the model property to make the model accessible via dstack's global OpenAI-compatible endpoint, and also accessible via dstack's UI.

Authorization¶

By default, the service enables authorization, meaning the service endpoint requires a dstack user token. This can be disabled by setting auth to false.

type: service
name: http-server-service

# Disable authorization
auth: false

python: 3.12

commands:
  - python3 -m http.server
port: 8000

Probes¶

Configure one or more HTTP probes to periodically check the health of the service.

type: service
name: my-service
port: 80
image: my-app:latest
probes:
- type: http
  url: /health
  interval: 15s

You can track probe statuses in dstack ps --verbose.

$ dstack ps --verbose

 NAME                            BACKEND          STATUS   PROBES  SUBMITTED
 my-service deployment=1                          running          11 mins ago
   replica=0 job=0 deployment=0  aws (us-west-2)  running  ✓       11 mins ago
   replica=1 job=0 deployment=1  aws (us-west-2)  running  ×       1 min ago

Probe statuses

The following symbols are used for probe statuses:

× — the last probe execution failed.
~ — the last probe execution succeeded, but the ready_after threshold is not yet reached.
✓ — the last ready_after probe executions succeeded.

If multiple probes are configured for the service, their statuses are displayed in the order in which the probes appear in the configuration.

Probes are executed for each service replica while the replica is running. A probe execution is considered successful if the replica responds with a 2xx status code. Probe statuses do not affect how dstack handles replicas, except during rolling deployments.

HTTP request configuration

You can configure the HTTP request method, headers, and other properties. To include secret values in probe requests, use environment variable interpolation, which is enabled for the url, headers[i].value, and body properties.

type: service
name: my-service
port: 80
image: my-app:latest
env:
- PROBES_API_KEY
probes:
- type: http
  method: post
  url: /check-health
  headers:
  - name: X-API-Key
    value: ${{ env.PROBES_API_KEY }}
  - name: Content-Type
    value: application/json
  body: '{"level": 2}'
  timeout: 20s

See the reference for more probe configuration options.

Path prefix¶

If your dstack project doesn't have a gateway, services are hosted with the /proxy/services/<project name>/<run name>/ path prefix in the URL. When running web apps, you may need to set some app-specific settings so that browser-side scripts and CSS work correctly with the path prefix.

type: service
name: dash
gateway: false

auth: false
# Do not strip the path prefix
strip_prefix: false

env:
  # Configure Dash to work with a path prefix
  # Replace `main` with your dstack project name
  - DASH_ROUTES_PATHNAME_PREFIX=/proxy/services/main/dash/

commands:
  - uv pip install dash
  # Assuming the Dash app is in your repo at app.py
  - python app.py

port: 8050

By default, dstack strips the prefix before forwarding requests to your service, so to the service it appears as if the prefix isn't there. This allows some apps to work out of the box. If your app doesn't expect the prefix to be stripped, set strip_prefix to false.

If your app cannot be configured to work with a path prefix, you can host it on a dedicated domain name by setting up a gateway.

Rate limits¶

If you have a gateway, you can configure rate limits for your service using the rate_limits property.

type: service
image: my-app:latest
port: 80

rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
  rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
  burst: 9

The rps limit sets the max requests per second, tracked in milliseconds (e.g., rps: 4 means 1 request every 250 ms). Use burst to allow short spikes while keeping the average within rps.

Limits apply to the whole service (all replicas) and per client (by IP). Clients exceeding the limit get a 429 error.

Partitioning key

Instead of partitioning requests by client IP address, you can choose to partition by the value of a header.

type: service
image: my-app:latest
port: 80

rate_limits:
- rps: 4
  burst: 9
  # Apply to each user, as determined by the `Authorization` header
  key:
    type: header
    header: Authorization

Resources¶

If you specify memory size, you can either specify an explicit size (e.g. 24GB) or a range (e.g. 24GB.., or 24GB..80GB, or ..80GB).

type: service
name: llama31-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - uv pip install vllm
  - |
    vllm serve $MODEL_ID
      --max-model-len $MAX_MODEL_LEN
      --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000

resources:
  # 16 or more x86_64 cores
  cpu: 16..
  # 2 GPUs of 80GB
  gpu: 80GB:2

  # Minimum disk size
  disk: 200GB

The cpu property lets you set the architecture (x86 or arm) and core count — e.g., x86:16 (16 x86 cores), arm:8.. (at least 8 ARM cores). If not set, dstack infers it from the GPU or defaults to x86.

The gpu property lets you specify vendor, model, memory, and count — e.g., nvidia (one NVIDIA GPU), A100 (one A100), A10G,A100 (either), A100:80GB (one 80GB A100), A100:2 (two A100), 24GB..40GB:2 (two GPUs with 24–40GB), A100:40GB:2 (two 40GB A100s).

If vendor is omitted, dstack infers it from the model or defaults to nvidia.

Shared memory

If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure shm_size, e.g. set it to 16GB.

If you’re unsure which offers (hardware configurations) are available from the configured backends, use the dstack offer command to list them.

Docker¶

Default image¶

If you don't specify image, dstack uses its base Docker image pre-configured with uv, python, pip, essential CUDA drivers, and NCCL tests (under /opt/nccl-tests/build).

Set the python property to pre-install a specific version of Python.

type: service
name: http-server-service    

python: 3.12

commands:
  - python3 -m http.server
port: 8000

NVCC¶

By default, the base Docker image doesn’t include nvcc, which is required for building custom CUDA kernels. If you need nvcc, set the nvcc property to true.

type: service
name: http-server-service    

python: 3.12
nvcc: true

commands:
  - python3 -m http.server
port: 8000

Custom image¶

If you want, you can specify your own Docker image via image.

type: service
name: http-server-service

image: python

commands:
  - python3 -m http.server
port: 8000

Docker in Docker¶

Set docker to true to enable the docker CLI in your service, e.g., to run Docker images or use Docker Compose.

type: service
name: chat-ui-task

auth: false

docker: true

working_dir: examples/misc/docker-compose
commands:
  - docker compose up
port: 9000

Cannot be used with python or image. Not supported on runpod, vastai, or kubernetes.

Privileged mode¶

To enable privileged mode, set privileged to true.

Not supported with runpod, vastai, and kubernetes.

Private registry¶

Use the registry_auth property to provide credentials for a private Docker registry.

type: service
name: serve-distill-deepseek

env:
  - NGC_API_KEY
  - NIM_MAX_MODEL_LEN=4096

image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b
registry_auth:
  username: $oauthtoken
  password: ${{ env.NGC_API_KEY }}
port: 8000

model: deepseek-ai/deepseek-r1-distill-llama-8b

resources:
  gpu: H100:1

Environment variables¶

type: service
name: llama-2-7b-service

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

If you don't assign a value to an environment variable (see HF_TOKEN above), dstack will require the value to be passed via the CLI or set in the current process.

System environment variables

The following environment variables are available in any run by default:

Name	Description
`DSTACK_RUN_NAME`	The name of the run
`DSTACK_REPO_ID`	The ID of the repo
`DSTACK_GPUS_NUM`	The total number of GPUs in the run

Files¶

By default, dstack automatically mounts the repo directory where you ran dstack init to any run configuration.

However, in some cases, you may not want to mount the entire directory (e.g., if it’s too large), or you might want to mount files outside of it. In such cases, you can use the files property.

type: service
name: llama-2-7b-service

files:
  - .:examples  # Maps the directory where `.dstack.yml` to `/workflow/examples`
  - ~/.ssh/id_rsa:/root/.ssh/id_rsa  # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

Each entry maps a local directory or file to a path inside the container. Both local and container paths can be relative or absolute.

If the local path is relative, it’s resolved relative to the configuration file.
If the container path is relative, it’s resolved relative to /workflow.

The container path is optional. If not specified, it will be automatically calculated.

type: service
name: llama-2-7b-service

files:
  - ../examples  # Maps `examples` (the parent directory of `.dstack.yml`) to `/workflow/examples`
  - ~/.ssh/id_rsa  # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

Note: If you want to use files without mounting the entire repo directory, make sure to pass --no-repo when running dstack apply:

$ dstack apply -f examples/.dstack.yml --no-repo

.gitignore and .dstackignore

dstack automatically excludes files and folders listed in .gitignore and .dstackignore.

Uploads are limited to 2MB. To avoid exceeding this limit, make sure to exclude unnecessary files. You can increase the default server limit by setting the DSTACK_SERVER_CODE_UPLOAD_LIMIT environment variable.

Experimental

The files feature is experimental. Feedback is highly appreciated.

Retry policy¶

By default, if dstack can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail.

If you'd like dstack to automatically retry, configure the retry property accordingly:

type: service
image: my-app:latest
port: 80

retry:
  on_events: [no-capacity, error, interruption]
  # Retry for up to 1 hour
  duration: 1h

If one replica of a multi-replica service fails with retry enabled, dstack will resubmit only the failed replica while keeping active replicas running.

Spot policy¶

By default, dstack uses on-demand instances. However, you can change that via the spot_policy property. It accepts spot, on-demand, and auto.

Utilization policy¶

Sometimes it’s useful to track whether a service is fully utilizing all GPUs. While you can check this with dstack metrics, dstack also lets you set a policy to auto-terminate the run if any GPU is underutilized.

Below is an example of a service that auto-terminate if any GPU stays below 10% utilization for 1 hour.

type: service
name: llama-2-7b-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

utilization_policy:
  min_gpu_utilization: 10
  time_window: 1h

Schedule¶

Specify schedule to start a service periodically at specific UTC times using the cron syntax:

type: service
name: llama-2-7b-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

schedule:
  cron: "0 8 * * mon-fri" # at 8:00 UTC from Monday through Friday

The schedule property can be combined with max_duration or utilization_policy to shutdown the service automatically when it's not needed.

Cron syntax

dstack supports POSIX cron syntax. One exception is that days of the week are started from Monday instead of Sunday so 0 corresponds to Monday.

The month and day of week fields accept abbreviated English month and weekday names (jan–dec and mon–sun) respectively.

A cron expression consists of five fields:

┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of the month (1-31)
│ │ │ ┌───────────── month (1-12 or jan-dec)
│ │ │ │ ┌───────────── day of the week (0-6 or mon-sun)
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
* * * * *

The following operators can be used in any of the fields:

Operator	Description	Example
`*`	Any value	`0 * * * *` runs every hour at minute 0
`,`	Value list separator	`15,45 10 * * *` runs at 10:15 and 10:45 every day.
`-`	Range of values	`0 1-3 * * *` runs at 1:00, 2:00, and 3:00 every day.
`/`	Step values	`/10 8-10 * *` runs every 10 minutes during the hours 8:00 to 10:59.

Creation policy¶

By default, when you run dstack apply with a dev environment, task, or service, if no idle instances from the available fleets meet the requirements, dstack creates a new fleet using configured backends.

To ensure dstack apply doesn't create a new fleet but reuses an existing one, pass -R (or --reuse) to dstack apply.

$ dstack apply -R -f examples/.dstack.yml

Or, set creation_policy to reuse in the run configuration.

Idle duration¶

If a fleet is created automatically, it stays idle for 5 minutes by default and can be reused within that time. If the fleet is not reused within this period, it is automatically terminated. To change the default idle duration, set idle_duration in the run configuration (e.g., 0s, 1m, or off for unlimited).

Fleets

For greater control over fleet provisioning, it is recommended to create fleets explicitly.

Reference

Services support many more configuration options, incl. backends, regions, max_price, and among others.

Rolling deployment¶

To deploy a new version of a service that is already running, use dstack apply. dstack will automatically detect changes and suggest a rolling deployment update.

$ dstack apply -f my-service.dstack.yml

Active run my-service already exists. Detected changes that can be updated in-place:
- Repo state (branch, commit, or other)
- File archives
- Configuration properties:
  - env
  - files

Update the run? [y/n]:

If approved, dstack gradually updates the service replicas. To update a replica, dstack starts a new replica, waits for it to become running and for all of its probes to pass, then terminates the old replica. This process is repeated for each replica, one at a time.

You can track the progress of rolling deployment in both dstack apply or dstack ps. Older replicas have lower deployment numbers; newer ones have higher.

$ dstack apply -f my-service.dstack.yml

⠋ Launching my-service...
 NAME                            BACKEND          PRICE    STATUS       SUBMITTED
 my-service deployment=1                                   running      11 mins ago
   replica=0 job=0 deployment=0  aws (us-west-2)  $0.0026  terminating  11 mins ago
   replica=1 job=0 deployment=1  aws (us-west-2)  $0.0026  running      1 min ago

The rolling deployment stops when all replicas are updated or when a new deployment is submitted.

Supported properties

Rolling deployment supports changes to the following properties: port, probes, resources, volumes, docker, files, image, user, privileged, entrypoint, working_dir, python, nvcc, single_branch, env, shell, commands, as well as changes to repo or file contents.

Changes to replicas and scaling can be applied without redeploying replicas.

Changes to other properties require a full service restart.

To trigger a rolling deployment when no properties have changed (e.g., after updating secrets or to restart all replicas),
make a minor config change, such as adding a dummy environment variable.

Manage runs¶

dstack provides several commands to manage runs:

dstack ps – Lists all running jobs and their statuses. Use --watch (or -w) to monitor the live status of runs.
dstack stop – Stops a run gracefully. Pass --abort or -x to stop it immediately without waiting for a graceful shutdown. By default, a run runs until you stop it or its lifetime exceeds the value of max_duration.
dstack attach – By default, dstack apply runs in attached mode, establishing an SSH tunnel to the run, forwarding ports, and displaying real-time logs. If you detach from a run, use this command to reattach.
dstack logs – Displays run logs. Pass --diagnose or -d to view diagnostic logs, which can help troubleshoot failed runs.

What's next?

Read about dev environments, tasks, and repos
Learn how to manage fleets
See how to set up gateways
Check the TGI , vLLM , and NIM examples