Services¶
A service allows you to deploy a web app or a model as a scalable endpoint. It lets you configure dependencies, resources, authorization, auto-scaling rules, etc.
Services are provisioned behind a gateway which provides an HTTPS endpoint mapped to your domain, handles authentication, distributes load, and performs auto-scaling.
Gateways
If you're using the open-source server, you must set up a gateway before you can run a service.
If you're using dstack Sky , the gateway is already set up for you.
Define a configuration¶
First, create a YAML file in your project folder. Its name must end with .dstack.yml
(e.g. .dstack.yml
or
service.dstack.yml
are both acceptable).
type: service
# The name is optional, if not specified, generated randomly
name: llama31-service
# If `image` is not specified, dstack uses its default image
python: "3.10"
# Required environment variables
env:
- HF_TOKEN
commands:
- pip install vllm
- vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
# Expose the vllm server port
port: 8000
# Use either spot or on-demand instances
spot_policy: auto
resources:
# Change to what is required
gpu: 24GB
# Comment if you don't to access the model via https://gateway.<gateway domain>
model:
type: chat
name: meta-llama/Meta-Llama-3.1-8B-Instruct
format: openai
If you don't specify your Docker image, dstack
uses the base image
(pre-configured with Python, Conda, and essential CUDA drivers).
Auto-scaling
By default, the service is deployed to a single instance. However, you can specify the
number of replicas and scaling policy.
In this case, dstack
auto-scales it based on the load.
Reference
See .dstack.yml for all the options supported by services, along with multiple examples.
Run a service¶
To run a configuration, use the dstack apply
command.
$ HF_TOKEN=...
$ dstack apply -f service.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:3 yes $0.33
Submit the run llama31-service? [y/n]: y
Provisioning...
---> 100%
Service is published at https://llama31-service.example.com
dstack apply
automatically uploads the code from the current repo, including your local uncommitted changes.
To avoid uploading large files, ensure they are listed in .gitignore
.
Access the endpoint¶
One the service is up, its endpoint is accessible at https://<run name>.<gateway domain>
.
By default, the service endpoint requires the Authorization
header with Bearer <dstack token>
.
$ curl https://llama31-service.example.com/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <dstack token>' \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'
Authorization can be disabled by setting auth
to false
in the
service configuration file.
Gateway endpoint¶
In case the service has the model mapping configured, you will also be
able to access the model at https://gateway.<gateway domain>
via the OpenAI-compatible interface.
Manage runs¶
List runs¶
The dstack ps
command lists all running jobs and their statuses.
Use --watch
(or -w
) to monitor the live status of runs.
Stop a run¶
Once the run exceeds the max_duration
, or when you use dstack stop
,
the dev environment is stopped. Use --abort
or -x
to stop the run abruptly.
Manage fleets¶
By default, dstack apply
reuses idle
instances from one of the existing fleets,
or creates a new fleet through backends.
Idle duration
To ensure the created fleets are deleted automatically, set
termination_idle_time
.
By default, it's set to 5min
.
Creation policy
To ensure dstack apply
always reuses an existing fleet and doesn't create a new one,
pass --reuse
to dstack apply
(or set creation_policy
to reuse
in the task configuration).
The default policy is reuse_or_create
.
What's next?¶
- Check the TGI and vLLM examples
- See gateways on how to set up a gateway
- Browse examples
- See fleets on how to manage fleets
Reference
See .dstack.yml for all the options supported by services, along with multiple examples.