Services¶
Services allow you to deploy models or any web app as a secure and scalable endpoint.
When running models, services provide access through the unified OpenAI-compatible endpoint.
Define a configuration¶
First, create a YAML file in your project folder. Its name must end with .dstack.yml
(e.g. .dstack.yml
or
service.dstack.yml
are both acceptable).
type: service
name: llama31
# If `image` is not specified, dstack uses its default image
python: "3.11"
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096
commands:
- pip install vllm
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Uncomment to leverage spot instances
#spot_policy: auto
resources:
gpu: 24GB
If you don't specify your Docker image, dstack
uses the base image
(pre-configured with Python, Conda, and essential CUDA drivers).
Note, the model
property is optional and not needed when deploying a non-OpenAI-compatible model or a regular web app.
Gateway
To enable auto-scaling, or use a custom domain with HTTPS, set up a gateway before running the service. If you're using dstack Sky , a gateway is pre-configured for you.
Reference
See .dstack.yml for all the options supported by services, along with multiple examples.
Run a service¶
To run a configuration, use the dstack apply
command.
$ HF_TOKEN=...
$ dstack apply -f service.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:3 yes $0.33
Submit the run llama31? [y/n]: y
Provisioning...
---> 100%
Service is published at:
http://localhost:3000/proxy/services/main/llama31/
Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
http://localhost:3000/proxy/models/main/
dstack apply
automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes),
and runs the configuration.
Access the endpoint¶
Service¶
If no gateway is created, the service’s endpoint will be accessible at
<dstack server URL>/proxy/services/<project name>/<run name>/
.
$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <dstack token>' \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'
When a gateway is configured, the service endpoint will be accessible at https://<run name>.<gateway domain>
.
By default, the service endpoint requires the Authorization
header with Bearer <dstack token>
.
Authorization can be disabled by setting auth
to false
in the
service configuration file.
Model¶
If the service defines the model
property, the model can be accessed with
the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/
,
or via the control plane UI's playground.
When a gateway is configured, the OpenAI-compatible endpoint is available at https://gateway.<gateway domain>/
.
Manage runs¶
List runs¶
The dstack ps
command lists all running jobs and their statuses.
Use --watch
(or -w
) to monitor the live status of runs.
Stop a run¶
Once the run exceeds the max_duration
, or when you use dstack stop
,
the dev environment is stopped. Use --abort
or -x
to stop the run abruptly.
Manage fleets¶
Creation policy¶
By default, when you run dstack apply
with a dev environment, task, or service,
dstack
reuses idle
instances from an existing fleet.
If no idle
instances matching the requirements, it automatically creates a new fleet
using backends.
To ensure dstack apply
doesn't create a new fleet but reuses an existing one,
pass -R
(or --reuse
) to dstack apply
.
$ dstack apply -R -f examples/.dstack.yml
Alternatively, set creation_policy
to reuse
in the run configuration.
Termination policy¶
If a fleet is created automatically, it remains idle
for 5 minutes and can be reused within that time.
To change the default idle duration, set
termination_idle_time
in the run configuration (e.g., to 0 or a
longer duration).
Fleets
For greater control over fleet provisioning, configuration, and lifecycle management, it is recommended to use fleets directly.
What's next?¶
- Read about dev environments, tasks, and repos
- Learn how to manage fleets
- See how to set up gateways
- Check the TGI , vLLM , and NIM examples
Reference
See .dstack.yml for all the options supported by services, along with multiple examples.