Services¶
A service allows you to deploy a model or a web app as an endpoint. It lets you configure dependencies, resources, authorization, auto-scaling rules, etc.
Define a configuration¶
First, create a YAML file in your project folder. Its name must end with .dstack.yml
(e.g. .dstack.yml
or
service.dstack.yml
are both acceptable).
type: service
name: llama31
# If `image` is not specified, dstack uses its default image
python: "3.11"
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096
commands:
- pip install vllm
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Uncomment to leverage spot instances
#spot_policy: auto
resources:
gpu: 24GB
If you don't specify your Docker image, dstack
uses the base image
(pre-configured with Python, Conda, and essential CUDA drivers).
Note, the model
property is optional and not needed when deploying a non-OpenAI-compatible model or a regular web app.
Gateway
To enable auto-scaling, or use a custom domain with HTTPS, set up a gateway before running the service. If you're using dstack Sky , a gateway is pre-configured for you.
Reference
See .dstack.yml for all the options supported by services, along with multiple examples.
Run a service¶
To run a configuration, use the dstack apply
command.
$ HF_TOKEN=...
$ dstack apply -f service.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:3 yes $0.33
Submit the run llama31? [y/n]: y
Provisioning...
---> 100%
Service is published at:
http://localhost:3000/proxy/services/main/llama31/
dstack apply
automatically uploads the code from the current repo, including your local uncommitted changes.
To avoid uploading large files, ensure they are listed in .gitignore
.
Access the endpoint¶
Service¶
If no gateway is created, the service’s endpoint will be accessible at
<dstack server URL>/proxy/services/<project name>/<run name>/
.
$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <dstack token>' \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'
When a gateway is configured, the service endpoint will be accessible at https://<run name>.<gateway domain>
.
By default, the service endpoint requires the Authorization
header with Bearer <dstack token>
.
Authorization can be disabled by setting auth
to false
in the
service configuration file.
Model¶
If the service defines the model
property, the model can be accessed with
the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/
,
or via the control plane UI's playground.
When a gateway is configured, the OpenAI-compatible endpoint is available at https://gateway.<gateway domain>/
.
Manage runs¶
List runs¶
The dstack ps
command lists all running jobs and their statuses.
Use --watch
(or -w
) to monitor the live status of runs.
Stop a run¶
Once the run exceeds the max_duration
, or when you use dstack stop
,
the dev environment is stopped. Use --abort
or -x
to stop the run abruptly.
Manage fleets¶
By default, dstack apply
reuses idle
instances from one of the existing fleets,
or creates a new fleet through backends.
Idle duration
To ensure the created fleets are deleted automatically, set
termination_idle_time
.
By default, it's set to 5min
.
Creation policy
To ensure dstack apply
always reuses an existing fleet and doesn't create a new one,
pass --reuse
to dstack apply
(or set creation_policy
to reuse
in the task configuration).
The default policy is reuse_or_create
.
What's next?¶
- Check the TGI , vLLM , and NIM examples
- See gateways on how to set up a gateway
- Browse examples
- See fleets on how to manage fleets
Reference
See .dstack.yml for all the options supported by services, along with multiple examples.