Services¶
Services allow you to deploy models or web apps as secure and scalable endpoints.
Apply a configuration¶
First, define a service configuration as a YAML file in your project folder.
The filename must end with .dstack.yml
(e.g. .dstack.yml
or dev.dstack.yml
are both acceptable).
type: service
name: llama31
# If `image` is not specified, dstack uses its default image
python: 3.12
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096
commands:
- uv pip install vllm
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# (Optional) Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Uncomment to leverage spot instances
#spot_policy: auto
resources:
gpu: 24GB
To run a service, pass the configuration to dstack apply
:
$ HF_TOKEN=...
$ dstack apply -f .dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:3 yes $0.33
Submit the run llama31? [y/n]: y
Provisioning...
---> 100%
Service is published at:
http://localhost:3000/proxy/services/main/llama31/
Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
http://localhost:3000/proxy/models/main/
dstack apply
automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes),
and runs the service.
If a gateway is not configured, the service’s endpoint will be accessible at
<dstack server URL>/proxy/services/<project name>/<run name>/
.
$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <dstack token>' \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'
If the service defines the model
property, the model can be accessed with
the global OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/
,
or via dstack
UI.
If authorization is not disabled, the service endpoint requires the Authorization
header with
Bearer <dstack token>
.
Gateway
Running services for development purposes doesn’t require setting up a gateway.
However, you'll need a gateway in the following cases:
- To use auto-scaling or rate limits
- To enable HTTPS for the endpoint and map it to your domain
- If your service requires WebSockets
- If your service cannot work with a path prefix
Note, if you're using dstack Sky , a gateway is already pre-configured for you.
If a gateway is configured, the service endpoint will be accessible at
https://<run name>.<gateway domain>/
.
If the service defines the model
property, the model will be available via the global OpenAI-compatible endpoint
at https://gateway.<gateway domain>/
.
Configuration options¶
Replicas and scaling¶
By default, dstack
runs a single replica of the service.
You can configure the number of replicas as well as the auto-scaling rules.
type: service
name: llama31-service
python: 3.12
env:
- HF_TOKEN
commands:
- uv pip install vllm
- vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
port: 8000
resources:
gpu: 24GB
replicas: 1..4
scaling:
# Requests per seconds
metric: rps
# Target metric value
target: 10
The replicas
property can be a number or a range.
The metric
property of scaling
only supports the rps
metric (requests per second). In this
case dstack
adjusts the number of replicas (scales up or down) automatically based on the load.
Setting the minimum number of replicas to 0
allows the service to scale down to zero when there are no requests.
The
scaling
property requires creating a gateway.
Model¶
If the service is running a chat model with an OpenAI-compatible interface,
set the model
property to make the model accessible via dstack
's
global OpenAI-compatible endpoint, and also accessible via dstack
's UI.
Authorization¶
By default, the service enables authorization, meaning the service endpoint requires a dstack
user token.
This can be disabled by setting auth
to false
.
type: service
name: http-server-service
# Disable authorization
auth: false
python: 3.12
commands:
- python3 -m http.server
port: 8000
Path prefix¶
If your dstack
project doesn't have a gateway, services are hosted with the
/proxy/services/<project name>/<run name>/
path prefix in the URL.
When running web apps, you may need to set some app-specific settings
so that browser-side scripts and CSS work correctly with the path prefix.
type: service
name: dash
gateway: false
auth: false
# Do not strip the path prefix
strip_prefix: false
env:
# Configure Dash to work with a path prefix
# Replace `main` with your dstack project name
- DASH_ROUTES_PATHNAME_PREFIX=/proxy/services/main/dash/
commands:
- uv pip install dash
# Assuming the Dash app is in your repo at app.py
- python app.py
port: 8050
By default, dstack
strips the prefix before forwarding requests to your service,
so to the service it appears as if the prefix isn't there. This allows some apps
to work out of the box. If your app doesn't expect the prefix to be stripped,
set strip_prefix
to false
.
If your app cannot be configured to work with a path prefix, you can host it on a dedicated domain name by setting up a gateway.
Rate limits¶
If you have a gateway, you can configure rate limits for your service
using the rate_limits
property.
type: service
image: my-app:latest
port: 80
rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
burst: 9
The rps limit sets the max requests per second, tracked in milliseconds (e.g., rps: 4
means 1 request every 250 ms). Use burst
to allow short spikes while keeping the average within rps
.
Limits apply to the whole service (all replicas) and per client (by IP). Clients exceeding the limit get a 429 error.
Partitioning key
Instead of partitioning requests by client IP address, you can choose to partition by the value of a header.
type: service
image: my-app:latest
port: 80
rate_limits:
- rps: 4
burst: 9
# Apply to each user, as determined by the `Authorization` header
key:
type: header
header: Authorization
Resources¶
If you specify memory size, you can either specify an explicit size (e.g. 24GB
) or a
range (e.g. 24GB..
, or 24GB..80GB
, or ..80GB
).
type: service
name: llama31-service
python: 3.12
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096
commands:
- uv pip install vllm
- |
vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
resources:
# 16 or more x86_64 cores
cpu: 16..
# 2 GPUs of 80GB
gpu: 80GB:2
# Minimum disk size
disk: 200GB
The cpu
property lets you set the architecture (x86
or arm
) and core count — e.g., x86:16
(16 x86 cores), arm:8..
(at least 8 ARM cores).
If not set, dstack
infers it from the GPU or defaults to x86
.
The gpu
property lets you specify vendor, model, memory, and count — e.g., nvidia
(one NVIDIA GPU), A100
(one A100), A10G,A100
(either), A100:80GB
(one 80GB A100), A100:2
(two A100), 24GB..40GB:2
(two GPUs with 24–40GB), A100:40GB:2
(two 40GB A100s).
If vendor is omitted, dstack
infers it from the model or defaults to nvidia
.
Shared memory
If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure
shm_size
, e.g. set it to 16GB
.
If you’re unsure which offers (hardware configurations) are available from the configured backends, use the
dstack offer
command to list them.
Docker¶
Default image¶
If you don't specify image
, dstack
uses its base Docker image pre-configured with
uv
, python
, pip
, essential CUDA drivers, and NCCL tests (under /opt/nccl-tests/build
).
Set the python
property to pre-install a specific version of Python.
type: service
name: http-server-service
python: 3.12
commands:
- python3 -m http.server
port: 8000
NVCC¶
By default, the base Docker image doesn’t include nvcc
, which is required for building custom CUDA kernels.
If you need nvcc
, set the nvcc
property to true.
type: service
name: http-server-service
python: 3.12
nvcc: true
commands:
- python3 -m http.server
port: 8000
Custom image¶
If you want, you can specify your own Docker image via image
.
type: service
name: http-server-service
image: python
commands:
- python3 -m http.server
port: 8000
Docker in Docker¶
Set docker
to true
to enable the docker
CLI in your service, e.g., to run Docker images or use Docker Compose.
type: service
name: chat-ui-task
auth: false
docker: true
working_dir: examples/misc/docker-compose
commands:
- docker compose up
port: 9000
Cannot be used with python
or image
. Not supported on runpod
, vastai
, or kubernetes
.
Privileged mode¶
To enable privileged mode, set privileged
to true
.
Not supported with runpod
, vastai
, and kubernetes
.
Private registry¶
Use the registry_auth
property to provide credentials for a private Docker registry.
type: service
name: serve-distill-deepseek
env:
- NGC_API_KEY
- NIM_MAX_MODEL_LEN=4096
image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b
registry_auth:
username: $oauthtoken
password: ${{ env.NGC_API_KEY }}
port: 8000
model: deepseek-ai/deepseek-r1-distill-llama-8b
resources:
gpu: H100:1
Environment variables¶
type: service
name: llama-2-7b-service
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
If you don't assign a value to an environment variable (see
HF_TOKEN
above),dstack
will require the value to be passed via the CLI or set in the current process.
System environment variables
The following environment variables are available in any run by default:
Name | Description |
---|---|
DSTACK_RUN_NAME |
The name of the run |
DSTACK_REPO_ID |
The ID of the repo |
DSTACK_GPUS_NUM |
The total number of GPUs in the run |
Files¶
By default, dstack
automatically mounts the repo directory where you ran dstack init
to any run configuration.
However, in some cases, you may not want to mount the entire directory (e.g., if it’s too large),
or you might want to mount files outside of it. In such cases, you can use the files
property.
type: service
name: llama-2-7b-service
files:
- .:examples # Maps the directory where `.dstack.yml` to `/workflow/examples`
- ~/.ssh/id_rsa:/root/.ssh/id_rsa # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
Each entry maps a local directory or file to a path inside the container. Both local and container paths can be relative or absolute.
- If the local path is relative, it’s resolved relative to the configuration file.
- If the container path is relative, it’s resolved relative to
/workflow
.
The container path is optional. If not specified, it will be automatically calculated.
type: service
name: llama-2-7b-service
files:
- ../examples # Maps `examples` (the parent directory of `.dstack.yml`) to `/workflow/examples`
- ~/.ssh/id_rsa # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
Note: If you want to use files
without mounting the entire repo directory,
make sure to pass --no-repo
when running dstack apply
:
$ dstack apply -f examples/.dstack.yml --no-repo
.gitignore and .dstackignore
dstack
automatically excludes files and folders listed in .gitignore
and .dstackignore
.
Uploads are limited to 2MB. To avoid exceeding this limit, make sure to exclude unnecessary files.
You can increase the default server limit by setting the DSTACK_SERVER_CODE_UPLOAD_LIMIT
environment variable.
Experimental
The files
feature is experimental. Feedback is highly appreciated.
Retry policy¶
By default, if dstack
can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail.
If you'd like dstack
to automatically retry, configure the
retry property accordingly:
type: service
image: my-app:latest
port: 80
retry:
on_events: [no-capacity, error, interruption]
# Retry for up to 1 hour
duration: 1h
If one replica of a multi-replica service fails with retry enabled,
dstack
will resubmit only the failed replica while keeping active replicas running.
Spot policy¶
By default, dstack
uses on-demand instances. However, you can change that
via the spot_policy
property. It accepts spot
, on-demand
, and auto
.
Utilization policy¶
Sometimes it’s useful to track whether a service is fully utilizing all GPUs. While you can check this with
dstack metrics
, dstack
also lets you set a policy to auto-terminate the run if any GPU is underutilized.
Below is an example of a service that auto-terminate if any GPU stays below 10% utilization for 1 hour.
type: service
name: llama-2-7b-service
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
utilization_policy:
min_gpu_utilization: 10
time_window: 1h
Creation policy¶
By default, when you run dstack apply
with a dev environment, task, or service,
if no idle
instances from the available fleets meet the requirements, dstack
creates a new fleet
using configured backends.
To ensure dstack apply
doesn't create a new fleet but reuses an existing one,
pass -R
(or --reuse
) to dstack apply
.
$ dstack apply -R -f examples/.dstack.yml
Or, set creation_policy
to reuse
in the run configuration.
Idle duration¶
If a fleet is created automatically, it stays idle
for 5 minutes by default and can be reused within that time.
If the fleet is not reused within this period, it is automatically terminated.
To change the default idle duration, set
idle_duration
in the run configuration (e.g., 0s
, 1m
, or off
for
unlimited).
Fleets
For greater control over fleet provisioning, it is recommended to create fleets explicitly.
Reference
Services support many more configuration options,
incl. backends
,
regions
,
max_price
, and
among others.
Manage runs¶
dstack
provides several commands to manage runs:
dstack ps
– Lists all running jobs and their statuses. Use--watch
(or-w
) to monitor the live status of runs.dstack stop
– Stops a run gracefully. Pass--abort
or-x
to stop it immediately without waiting for a graceful shutdown. By default, a run runs until you stop it or its lifetime exceeds the value ofmax_duration
.dstack attach
– By default,dstack apply
runs in attached mode, establishing an SSH tunnel to the run, forwarding ports, and displaying real-time logs. If you detach from a run, use this command to reattach.dstack logs
– Displays run logs. Pass--diagnose
or-d
to view diagnostic logs, which can help troubleshoot failed runs.
What's next?