Skip to content

service

The service configuration type allows running services.

Configuration files must be inside the project repo, and their names must end with .dstack.yml (e.g. .dstack.yml or serve.dstack.yml are both acceptable). Any configuration can be run via dstack apply.

Examples

Python version

If you don't specify image, dstack uses its base Docker image pre-configured with python, pip, conda (Miniforge), and essential CUDA drivers. The python property determines which default Docker image is used.

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service    

# If `image` is not specified, dstack uses its base image
python: "3.10"

# Commands of the service
commands:
  - python3 -m http.server
# The port of the service
port: 8000
nvcc

By default, the base Docker image doesn’t include nvcc, which is required for building custom CUDA kernels. If you need nvcc, set the corresponding property to true.

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service    

# If `image` is not specified, dstack uses its base image
python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention) 
nvcc: true

 # Commands of the service
commands:
  - python3 -m http.server
# The port of the service
port: 8000

Docker

If you want, you can specify your own Docker image via image.

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

# Any custom Docker image
image: dstackai/base:py3.13-0.6-cuda-12.1

# Commands of the service
commands:
  - python3 -m http.server
# The port of the service
port: 8000
Private Docker registry

Use the registry_auth property to provide credentials for a private Docker registry.

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

# Any private Docker iamge
image: dstackai/base:py3.13-0.6-cuda-12.1
# Credentials of the private registry
registry_auth:
  username: peterschmidt85
  password: ghp_e49HcZ9oYwBzUbcSk2080gXZOU2hiT9AeSR5

# Commands of the service  
commands:
  - python3 -m http.server
# The port of the service
port: 8000

Docker and Docker Compose

All backends except runpod, vastai and kubernetes also allow to use Docker and Docker Compose inside dstack runs.

Models

If you are running a chat model with an OpenAI-compatible interface, set the model property to make the model accessible via the OpenAI-compatible endpoint provided by dstack.

type: service
# The name is optional, if not specified, generated randomly
name: llama31-service

python: "3.10"

# Required environment variables
env:
  - HF_TOKEN
commands:
  - pip install vllm
  - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
# Expose the port of the service
port: 8000

resources:
  # Change to what is required
  gpu: 24GB

# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Alternatively, use this syntax to set more model settings:
# model:
#   type: chat
#   name: meta-llama/Meta-Llama-3.1-8B-Instruct
#   format: openai
#   prefix: /v1

Once the service is up, the model will be available via the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name> or at https://gateway.<gateway domain> if your project has a gateway.

Auto-scaling

By default, dstack runs a single replica of the service. You can configure the number of replicas as well as the auto-scaling rules.

type: service
# The name is optional, if not specified, generated randomly
name: llama31-service

python: "3.10"

# Required environment variables
env:
  - HF_TOKEN
commands:
  - pip install vllm
  - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
# Expose the port of the service
port: 8000

resources:
  # Change to what is required
  gpu: 24GB

# Minimum and maximum number of replicas
replicas: 1..4
scaling:
  # Requests per seconds
  metric: rps
  # Target metric value
  target: 10

The replicas property can be a number or a range.

The metric property of scaling only supports the rps metric (requests per second). In this case dstack adjusts the number of replicas (scales up or down) automatically based on the load.

Setting the minimum number of replicas to 0 allows the service to scale down to zero when there are no requests.

Gateway

Services with a fixed number of replicas are supported both with and without a gateway. Auto-scaling is currently only supported for services running with a gateway.

Resources

If you specify memory size, you can either specify an explicit size (e.g. 24GB) or a range (e.g. 24GB.., or 24GB..80GB, or ..80GB).

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

python: "3.10"

# Commands of the service
commands:
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server
    --model mistralai/Mixtral-8X7B-Instruct-v0.1
    --host 0.0.0.0
    --tensor-parallel-size $DSTACK_GPUS_NUM
# Expose the port of the service
port: 8000

resources:
  # 2 GPUs of 80GB
  gpu: 80GB:2

  # Minimum disk size
  disk: 200GB

The gpu property allows specifying not only memory size but also GPU vendor, names and their quantity. Examples: nvidia (one NVIDIA GPU), A100 (one A100), A10G,A100 (either A10G or A100), A100:80GB (one A100 of 80GB), A100:2 (two A100), 24GB..40GB:2 (two GPUs between 24GB and 40GB), A100:40GB:2 (two A100 GPUs of 40GB).

Shared memory

If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure shm_size, e.g. set it to 16GB.

Authorization

By default, the service endpoint requires the Authorization header with "Bearer <dstack token>". Authorization can be disabled by setting auth to false.

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

# Disable authorization
auth: false

python: "3.10"

# Commands of the service
commands:
  - python3 -m http.server
# The port of the service
port: 8000

Environment variables

type: service
# The name is optional, if not specified, generated randomly
name: llama-2-7b-service

python: "3.10"

# Environment variables
env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
# Commands of the service
commands:
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
# The port of the service
port: 8000

resources:
  # Required GPU vRAM
  gpu: 24GB

If you don't assign a value to an environment variable (see HF_TOKEN above), dstack will require the value to be passed via the CLI or set in the current process.

For instance, you can define environment variables in a .envrc file and utilize tools like direnv.

System environment variables

The following environment variables are available in any run and are passed by dstack by default:

Name Description
DSTACK_RUN_NAME The name of the run
DSTACK_REPO_ID The ID of the repo
DSTACK_GPUS_NUM The total number of GPUs in the run

Spot policy

You can choose whether to use spot instances, on-demand instances, or any available type.

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

commands:
  - python3 -m http.server
# The port of the service
port: 8000

# Uncomment to leverage spot instances
#spot_policy: auto

The spot_policy accepts spot, on-demand, and auto. The default for services is on-demand.

Backends

By default, dstack provisions instances in all configured backends. However, you can specify the list of backends:

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

# Commands of the service
commands:
  - python3 -m http.server
# The port of the service
port: 8000

# Use only listed backends
backends: [aws, gcp]

Regions

By default, dstack uses all configured regions. However, you can specify the list of regions:

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

# Commands of the service
commands:
  - python3 -m http.server
# The port of the service
port: 8000

# Use only listed regions
regions: [eu-west-1, eu-west-2]

Volumes

Volumes allow you to persist data between runs. To attach a volume, simply specify its name using the volumes property and specify where to mount its contents:

type: service
# The name is optional, if not specified, generated randomly
name: http-server-service

# Commands of the service
commands:
  - python3 -m http.server
# The port of the service
port: 8000

# Map the name of the volume to any path
volumes:
  - name: my-new-volume
    path: /volume_data

Once you run this configuration, the contents of the volume will be attached to /volume_data inside the service, and its contents will persist across runs.

Instance volumes

If data persistence is not a strict requirement, use can also use ephemeral instance volumes.

Limitations

When you're running a dev environment, task, or service with dstack, it automatically mounts the project folder contents to /workflow (and sets that as the current working directory). Right now, dstack doesn't allow you to attach volumes to /workflow or any of its subdirectories.

The service configuration type supports many other options. See below.

Root reference

port - The port, that application listens on or the mapping.

gateway - (Optional) The name of the gateway. Specify boolean false to run without a gateway. Omit to run with the default gateway.

model - (Optional) Mapping of the model for the OpenAI-compatible endpoint provided by dstack. Can be a full model format definition or just a model name. If it's a name, the service is expected to expose an OpenAI-compatible API at the /v1 path.

https - (Optional) Enable HTTPS if running with a gateway. Defaults to True.

auth - (Optional) Enable the authorization. Defaults to True.

replicas - (Optional) The number of replicas. Can be a number (e.g. 2) or a range (0..4 or 1..8). If it's a range, the scaling property is required. Defaults to 1.

scaling - (Optional) The auto-scaling rules. Required if replicas is set to a range.

name - (Optional) The run name.

image - (Optional) The name of the Docker image to run.

user - (Optional) The user inside the container, user_name_or_id[:group_name_or_id] (e.g., ubuntu, 1000:1000). Defaults to the default image user.

privileged - (Optional) Run the container in privileged mode.

entrypoint - (Optional) The Docker entrypoint.

working_dir - (Optional) The path to the working directory inside the container. It's specified relative to the repository directory (/workflow) and should be inside it. Defaults to "." .

registry_auth - (Optional) Credentials for pulling a private Docker image.

python - (Optional) The major version of Python. Mutually exclusive with image.

nvcc - (Optional) Use image with NVIDIA CUDA Compiler (NVCC) included. Mutually exclusive with image.

env - (Optional) The mapping or the list of environment variables.

resources - (Optional) The resources requirements to run the configuration.

volumes - (Optional) The volumes mount points.

commands - (Optional) The bash commands to run.

backends - (Optional) The backends to consider for provisioning (e.g., [aws, gcp]).

regions - (Optional) The regions to consider for provisioning (e.g., [eu-west-1, us-west4, westeurope]).

instance_types - (Optional) The cloud-specific instance types to consider for provisioning (e.g., [p3.8xlarge, n1-standard-4]).

reservation - (Optional) The existing reservation for the instances.

spot_policy - (Optional) The policy for provisioning spot or on-demand instances: spot, on-demand, or auto. Defaults to on-demand.

retry - (Optional) The policy for resubmitting the run. Defaults to false.

retry_policy - (Optional) The policy for resubmitting the run. Deprecated in favor of retry.

max_duration - (Optional) The maximum duration of a run (e.g., 2h, 1d, etc). After it elapses, the run is forced to stop. Defaults to off.

max_price - (Optional) The maximum instance price per hour, in dollars.

pool_name - (Optional) The name of the pool. If not set, dstack will use the default name.

instance_name - (Optional) The name of the instance.

creation_policy - (Optional) The policy for using instances from the pool. Defaults to reuse-or-create.

termination_policy - (Optional) The policy for instance termination. Defaults to destroy-after-idle.

termination_idle_time - (Optional) Time to wait before destroying the idle instance. Defaults to 5m for dstack run and to 3d for dstack pool add.

model[format=openai]

type - The type of the model. Must be chat.

name - The name of the model.

format - The serving format.

prefix - (Optional) The base_url prefix (after hostname). Defaults to /v1.

model[format=tgi]

TGI provides an OpenAI-compatible API starting with version 1.4.0, so models served by TGI can be defined with format: openai too.

type - The type of the model. Must be chat.

name - The name of the model.

format - The serving format.

chat_template - (Optional) The custom prompt template for the model. If not specified, the default prompt template from the HuggingFace Hub configuration will be used.

eos_token - (Optional) The custom end of sentence token. If not specified, the default end of sentence token from the HuggingFace Hub configuration will be used.

Chat template

By default, dstack loads the chat template from the model's repository. If it is not present there, manual configuration is required.

type: service

image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
commands:
  - text-generation-launcher --port 8000 --trust-remote-code --quantize gptq
port: 8000

resources:
  gpu: 80GB

# Enable the OpenAI-compatible endpoint
model:
  type: chat
  name: TheBloke/Llama-2-13B-chat-GPTQ
  format: tgi
  chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<s>[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' </s>' }}{% endif %}{% endfor %}"
  eos_token: "</s>"
Limitations

Please note that model mapping is an experimental feature with the following limitations:

  1. Doesn't work if your chat_template uses bos_token. As a workaround, replace bos_token inside chat_template with the token content itself.
  2. Doesn't work if eos_token is defined in the model repository as a dictionary. As a workaround, set eos_token manually, as shown in the example above (see Chat template).

If you encounter any other issues, please make sure to file a GitHub issue.

scaling

metric - The target metric to track. Currently, the only supported value is rps (meaning requests per second).

target - The target value of the metric. The number of replicas is calculated based on this number and automatically adjusts (scales up or down) as this metric changes.

scale_up_delay - (Optional) The delay in seconds before scaling up. Defaults to 300.

scale_down_delay - (Optional) The delay in seconds before scaling down. Defaults to 600.

resources

cpu - (Optional) The number of CPU cores. Defaults to 2...

memory - (Optional) The RAM size (e.g., 8GB). Defaults to 8GB...

shm_size - (Optional) The size of shared memory (e.g., 8GB). If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure this.

gpu - (Optional) The GPU requirements. Can be set to a number, a string (e.g. A100, 80GB:2, etc.), or an object.

disk - (Optional) The disk resources.

resouces.gpu

vendor - (Optional) The vendor of the GPU/accelerator, one of: nvidia, amd, google (alias: tpu).

name - (Optional) The GPU name or list of names.

count - (Optional) The number of GPUs. Defaults to 1.

memory - (Optional) The RAM size (e.g., 16GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

total_memory - (Optional) The total RAM size (e.g., 32GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

compute_capability - (Optional) The minimum compute capability of the GPU (e.g., 7.5).

resouces.disk

size - The disk size. Can be a string (e.g., 100GB or 100GB..) or an object.

registry_auth

username - The username.

password - The password or access token.

volumes[n]

name - The network volume name or the list of network volume names to mount. If a list is specified, one of the volumes in the list will be mounted. Specify volumes from different backends/regions to increase availability..

path - The absolute container path to mount the volume at.

instance_path - The absolute path on the instance (host).

path - The absolute path in the container.

Short syntax

The short syntax for volumes is a colon-separated string in the form of source:destination

  • volume-name:/container/path for network volumes
  • /instance/path:/container/path for instance volumes