service¶
The service
configuration type allows running services.
Filename
Configuration files must have a name ending with .dstack.yml
(e.g., .dstack.yml
or serve.dstack.yml
are both acceptable)
and can be located in the project's root directory or any nested folder.
Any configuration can be run via dstack run
.
Examples¶
Python version¶
If you don't specify image
, dstack
uses the default Docker image pre-configured with
python
, pip
, conda
(Miniforge), and essential CUDA drivers.
The python
property determines which default Docker image is used.
type: service
python: "3.11"
commands:
- python3 -m http.server
port: 8000
nvcc
Note that the default Docker image doesn't bundle nvcc
, which is required for building custom CUDA kernels.
To install it, use conda install cuda
.
Docker image¶
type: service
image: dstackai/base:py3.11-0.4rc4-cuda-12.1
commands:
- python3 -m http.server
port: 8000
Private Docker registry
Use the registry_auth
property to provide credentials for a private Docker registry.
type: service
image: dstackai/base:py3.11-0.4rc4-cuda-12.1
commands:
- python3 -m http.server
registry_auth:
username: peterschmidt85
password: ghp_e49HcZ9oYwBzUbcSk2080gXZOU2hiT9AeSR5
port: 8000
OpenAI-compatible interface¶
By default, if you run a service, its endpoint is accessible at https://<run name>.<gateway domain>
.
If you run a model, you can optionally configure the mapping to make it accessible via the OpenAI-compatible interface.
type: service
python: "3.11"
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
# Enable the OpenAI-compatible endpoint
model:
format: openai
type: chat
name: NousResearch/Llama-2-7b-chat-hf
In this case, with such a configuration, once the service is up, you'll be able to access the model at
https://gateway.<gateway domain>
via the OpenAI-compatible interface.
See services for more detail.
Replicas and auto-scaling¶
By default, dstack
runs a single replica of the service.
You can configure the number of replicas as well as the auto-scaling policy.
type: service
python: "3.11"
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
# Enable the OpenAI-compatible endpoint
model:
format: openai
type: chat
name: NousResearch/Llama-2-7b-chat-hf
replicas: 1..4
scaling:
metric: rps
target: 10
If you specify the minimum number of replicas as 0
, the service will scale down to zero when there are no requests.
Resources¶
If you specify memory size, you can either specify an explicit size (e.g. 24GB
) or a
range (e.g. 24GB..
, or 24GB..80GB
, or ..80GB
).
type: service
python: "3.11"
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server
--model mistralai/Mixtral-8X7B-Instruct-v0.1
--host 0.0.0.0
--tensor-parallel-size 2 # Match the number of GPUs
port: 8000
resources:
# 2 GPUs of 80GB
gpu: 80GB:2
disk: 200GB
# Enable the OpenAI-compatible endpoint
model:
type: chat
name: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
format: openai
The gpu
property allows specifying not only memory size but also GPU names
and their quantity. Examples: A100
(one A100), A10G,A100
(either A10G or A100),
A100:80GB
(one A100 of 80GB), A100:2
(two A100), 24GB..40GB:2
(two GPUs between 24GB and 40GB),
A100:40GB:2
(two A100 GPUs of 40GB).
Shared memory
If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure
shm_size
, e.g. set it to 16GB
.
Authorization¶
By default, the service endpoint requires the Authorization
header with "Bearer <dstack token>"
.
Authorization can be disabled by setting auth
to false
.
type: service
python: "3.11"
commands:
- python3 -m http.server
port: 8000
auth: false
Environment variables¶
type: service
python: "3.11"
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
If you don't assign a value to an environment variable (see HUGGING_FACE_HUB_TOKEN
above),
dstack
will require the value to be passed via the CLI or set in the current process.
For instance, you can define environment variables in a .env
file and utilize tools like direnv
.
Default environment variables¶
The following environment variables are available in any run and are passed by dstack
by default:
Name | Description |
---|---|
DSTACK_RUN_NAME |
The name of the run |
DSTACK_REPO_ID |
The ID of the repo |
DSTACK_GPUS_NUM |
The total number of GPUs in the run |
Spot policy¶
You can choose whether to use spot instances, on-demand instances, or any available type.
type: service
commands:
- python3 -m http.server
port: 8000
spot_policy: auto
The spot_policy
accepts spot
, on-demand
, and auto
. The default for services is auto
.
Backends¶
By default, dstack
provisions instances in all configured backends. However, you can specify the list of backends:
type: service
commands:
- python3 -m http.server
port: 8000
backends: [aws, gcp]
Regions¶
By default, dstack
uses all configured regions. However, you can specify the list of regions:
type: service
commands:
- python3 -m http.server
port: 8000
regions: [eu-west-1, eu-west-2]
The service
configuration type supports many other options. See below.