Skip to content

service

The service configuration type allows running services.

Configuration files must have a name ending with .dstack.yml (e.g., .dstack.yml or serve.dstack.yml are both acceptable) and can be located in the project's root directory or any nested folder. Any configuration can be run via dstack run . -f PATH.

Examples

Python version

If you don't specify image, dstack uses the default Docker image pre-configured with python, pip, conda (Miniforge), and essential CUDA drivers. The python property determines which default Docker image is used.

type: service

python: "3.11"

commands:
  - python3 -m http.server

port: 8000

nvcc

Note that the default Docker image doesn't bundle nvcc, which is required for building custom CUDA kernels. To install it, use conda install cuda.

Docker image

type: service

image: dstackai/base:py3.11-0.4rc4-cuda-12.1

commands:
  - python3 -m http.server

port: 8000
Private Docker registry

Use the registry_auth property to provide credentials for a private Docker registry.

type: service

image: dstackai/base:py3.11-0.4rc4-cuda-12.1

commands:
  - python3 -m http.server
registry_auth:
  username: peterschmidt85
  password: ghp_e49HcZ9oYwBzUbcSk2080gXZOU2hiT9AeSR5

port: 8000

OpenAI-compatible interface

By default, if you run a service, its endpoint is accessible at https://<run name>.<gateway domain>.

If you run a model, you can optionally configure the mapping to make it accessible via the OpenAI-compatible interface.

type: service

python: "3.11"

env:
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

# Enable the OpenAI-compatible endpoint
model:
  format: openai
  type: chat
  name: NousResearch/Llama-2-7b-chat-hf

In this case, with such a configuration, once the service is up, you'll be able to access the model at https://gateway.<gateway domain> via the OpenAI-compatible interface.

The format supports only tgi (Text Generation Inference) and openai (if you are using Text Generation Inference or vLLM with OpenAI-compatible mode).

Chat template

By default, dstack loads the chat template from the model's repository. If it is not present there, manual configuration is required.

type: service

image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
commands:
  - text-generation-launcher --port 8000 --trust-remote-code --quantize gptq
port: 8000

resources:
  gpu: 80GB

# Enable the OpenAI-compatible endpoint
model:
  type: chat
  name: TheBloke/Llama-2-13B-chat-GPTQ
  format: tgi
  chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<s>[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' </s>' }}{% endif %}{% endfor %}"
  eos_token: "</s>"
Limitations

Please note that model mapping is an experimental feature with the following limitations:

  1. Doesn't work if your chat_template uses bos_token. As a workaround, replace bos_token inside chat_template with the token content itself.
  2. Doesn't work if eos_token is defined in the model repository as a dictionary. As a workaround, set eos_token manually, as shown in the example above (see Chat template).

If you encounter any other issues, please make sure to file a GitHub issue.

Replicas and auto-scaling

By default, dstack runs a single replica of the service. You can configure the number of replicas as well as the auto-scaling policy.

type: service

python: "3.11"

env:
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

# Enable the OpenAI-compatible endpoint
model:
  format: openai
  type: chat
  name: NousResearch/Llama-2-7b-chat-hf

replicas: 1..4
scaling:
  metric: rps
  target: 10

If you specify the minimum number of replicas as 0, the service will scale down to zero when there are no requests.

Resources

If you specify memory size, you can either specify an explicit size (e.g. 24GB) or a range (e.g. 24GB.., or 24GB..80GB, or ..80GB).

type: service

python: "3.11"
commands:
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server
    --model mistralai/Mixtral-8X7B-Instruct-v0.1
    --host 0.0.0.0
    --tensor-parallel-size 2 # Match the number of GPUs
port: 8000

resources:
  # 2 GPUs of 80GB
  gpu: 80GB:2

  disk: 200GB

# Enable the OpenAI-compatible endpoint
model:
  type: chat
  name: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
  format: openai

The gpu property allows specifying not only memory size but also GPU names and their quantity. Examples: A100 (one A100), A10G,A100 (either A10G or A100), A100:80GB (one A100 of 80GB), A100:2 (two A100), 24GB..40GB:2 (two GPUs between 24GB and 40GB), A100:40GB:2 (two A100 GPUs of 40GB).

Shared memory

If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure shm_size, e.g. set it to 16GB.

Authorization

By default, the service endpoint requires the Authorization header with "Bearer <dstack token>". Authorization can be disabled by setting auth to false.

type: service

python: "3.11"

commands:
  - python3 -m http.server

port: 8000

auth: false

Environment variables

type: service

python: "3.11"

env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

If you don't assign a value to an environment variable (see HUGGING_FACE_HUB_TOKEN above), dstack will require the value to be passed via the CLI or set in the current process.

For instance, you can define environment variables in a .env file and utilize tools like direnv.

Default environment variables

The following environment variables are available in any run and are passed by dstack by default:

Name Description
DSTACK_RUN_NAME The name of the run
DSTACK_REPO_ID The ID of the repo
DSTACK_GPUS_NUM The total number of GPUs in the run

Spot policy

You can choose whether to use spot instances, on-demand instances, or any available type.

type: service

commands:
  - python3 -m http.server

port: 8000

spot_policy: auto

The spot_policy accepts spot, on-demand, and auto. The default for services is auto.

Backends

By default, dstack provisions instances in all configured backends. However, you can specify the list of backends:

type: service

commands:
  - python3 -m http.server

port: 8000

backends: [aws, gcp]

Regions

By default, dstack uses all configured regions. However, you can specify the list of regions:

type: service

commands:
  - python3 -m http.server

port: 8000

regions: [eu-west-1, eu-west-2]

The service configuration type supports many other options. See below.

Root reference

port - The port, that application listens on or the mapping.

model - (Optional) Mapping of the model for the OpenAI-compatible endpoint.

https - (Optional) Enable HTTPS. Defaults to True.

auth - (Optional) Enable the authorization. Defaults to True.

replicas - (Optional) The range . Defaults to 1.

scaling - (Optional) The auto-scaling configuration.

image - (Optional) The name of the Docker image to run.

entrypoint - (Optional) The Docker entrypoint.

home_dir - (Optional) The absolute path to the home directory inside the container. Defaults to /root.

registry_auth - (Optional) Credentials for pulling a private Docker image.

python - (Optional) The major version of Python. Mutually exclusive with image.

env - (Optional) The mapping or the list of environment variables.

setup - (Optional) The bash commands to run on the boot.

resources - (Optional) The resources requirements to run the configuration.

commands - (Optional) The bash commands to run.

backends - (Optional) The backends to consider for provisionig (e.g., [aws, gcp]).

regions - (Optional) The regions to consider for provisionig (e.g., [eu-west-1, us-west4, westeurope]).

instance_types - (Optional) The cloud-specific instance types to consider for provisionig (e.g., [p3.8xlarge, n1-standard-4]).

spot_policy - (Optional) The policy for provisioning spot or on-demand instances: spot, on-demand, or auto.

retry_policy - (Optional) The policy for re-submitting the run.

max_duration - (Optional) The maximum duration of a run (e.g., 2h, 1d, etc). After it elapses, the run is forced to stop. Defaults to off.

max_price - (Optional) The maximum price per hour, in dollars.

pool_name - (Optional) The name of the pool. If not set, dstack will use the default name.

instance_name - (Optional) The name of the instance.

creation_policy - (Optional) The policy for using instances from the pool. Defaults to reuse-or-create.

termination_policy - (Optional) The policy for termination instances. Defaults to destroy-after-idle.

termination_idle_time - (Optional) Time to wait before destroying the idle instance. Defaults to 5m for dstack run and to 3d for dstack pool add.

model

type - The type of the model.

name - The name of the model.

format - The serving format.

scaling

metric - The target metric to track.

target - The target value of the metric.

scale_up_delay - (Optional) The delay in seconds before scaling up. Defaults to 300.

scale_down_delay - (Optional) The delay in seconds before scaling down. Defaults to 600.

resources

cpu - (Optional) The number of CPU cores. Defaults to 2...

memory - (Optional) The RAM size (e.g., 8GB). Defaults to 8GB...

shm_size - (Optional) The size of shared memory (e.g., 8GB). If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure this.

gpu - (Optional) The GPU requirements. Can be set to a number, a string (e.g. A100, 80GB:2, etc.), or an object; see examples.

disk - (Optional) The disk resources.

resouces.gpu

name - (Optional) The GPU name or list of names.

count - (Optional) The number of GPUs. Defaults to 1.

memory - (Optional) The VRAM size (e.g., 16GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

total_memory - (Optional) The total VRAM size (e.g., 32GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

compute_capability - (Optional) The minimum compute capability of the GPU (e.g., 7.5).

resouces.disk

size - The disk size. Can be a string (e.g., 100GB or 100GB..) or an object; see examples.

registry_auth

username - The username.

password - The password or access token.