Services¶
Services make it very easy to deploy any kind of model or application as public, secure, and scalable endpoints.
Gateway
If you're using the open-source server, you first have to set up a gateway.
Gateway¶
For example, if your domain is example.com
, go ahead and run the
dstack gateway create
command:
$ dstack gateway create --domain example.com --region eu-west-1 --backend aws
Creating gateway...
---> 100%
BACKEND REGION NAME ADDRESS DOMAIN DEFAULT
aws eu-west-1 sour-fireant 52.148.254.14 example.com ✓
Afterward, in your domain's DNS settings, add an A
DNS record for *.example.com
pointing to the IP address of the gateway.
Now, if you run a service, dstack
will make its endpoint available at
https://<run name>.<gateway domain>
.
In case your service has the model mapping configured, dstack
will
automatically make your model available at https://gateway.<gateway domain>
via the OpenAI-compatible interface.
If you're using dstack Sky , the gateway is set up for you.
Configuration¶
First, create a YAML file in your project folder. Its name must end with .dstack.yml
(e.g. .dstack.yml
or train.dstack.yml
are both acceptable).
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1
commands:
- text-generation-launcher --port 8000 --trust-remote-code
port: 8000
resources:
gpu: 80GB
The YAML file allows you to specify your own Docker image, environment variables,
resource requirements, etc.
If image is not specified, dstack
uses its own (pre-configured with Python, Conda, and essential CUDA drivers).
.dstack.yml
For more details on the file syntax, refer to the .dstack.yml
reference.
Environment variables¶
Environment variables can be set either within the configuration file or passed via the CLI.
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1
commands:
- text-generation-launcher --port 8000 --trust-remote-code
port: 8000
resources:
gpu: 80GB
If you don't assign a value to an environment variable (see HUGGING_FACE_HUB_TOKEN
above),
dstack
will require the value to be passed via the CLI or set in the current process.
For instance, you can define environment variables in a .env
file and utilize tools like direnv
.
Model mapping¶
By default, if you run a service, its endpoint is accessible at https://<run name>.<gateway domain>
.
If you run a model, you can optionally configure the mapping to make it accessible via the OpenAI-compatible interface.
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1
commands:
- text-generation-launcher --port 8000 --trust-remote-code
port: 8000
resources:
gpu: 80GB
# Enable the OpenAI-compatible endpoint
model:
type: chat
name: mistralai/Mistral-7B-Instruct-v0.1
format: tgi
In this case, with such a configuration, once the service is up, you'll be able to access the model at
https://gateway.<gateway domain>
via the OpenAI-compatible interface.
The format
supports only tgi
(Text Generation Inference)
and openai
(if you are using Text Generation Inference or vLLM with OpenAI-compatible mode).
Chat template
By default, dstack
loads the chat template
from the model's repository. If it is not present there, manual configuration is required.
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
commands:
- text-generation-launcher --port 8000 --trust-remote-code --quantize gptq
port: 8000
resources:
gpu: 80GB
# Enable the OpenAI-compatible endpoint
model:
type: chat
name: TheBloke/Llama-2-13B-chat-GPTQ
format: tgi
chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<s>[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' </s>' }}{% endif %}{% endfor %}"
eos_token: "</s>"
Limitations¶
Please note that model mapping is an experimental feature with the following limitations:
- Doesn't work if your
chat_template
usesbos_token
. As a workaround, replacebos_token
insidechat_template
with the token content itself. - Doesn't work if
eos_token
is defined in the model repository as a dictionary. As a workaround, seteos_token
manually, as shown in the example above (see Chat template).
If you encounter any other issues, please make sure to file a GitHub issue.
Replicas and scaling¶
By default, dstack
runs a single replica of the service.
You can configure the number of replicas as well as the auto-scaling policy.
type: service
python: "3.11"
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
replicas: 1..4
scaling:
metric: rps
target: 10
# Enable the OpenAI-compatible endpoint
model:
format: openai
type: chat
name: NousResearch/Llama-2-7b-chat-hf
If you specify the minimum number of replicas as 0
, the service will scale down to zero when there are no requests.
Profiles
In case you'd like to reuse certain parameters (such as spot policy, retry and max duration,
max price, regions, instance types, etc.) across runs, you can define them via .dstack/profiles.yml
.
Running¶
To run a configuration, use the dstack run
command followed by the working directory path,
configuration file path, and any other options.
$ dstack run . -f serve.dstack.yml
BACKEND REGION RESOURCES SPOT PRICE
tensordock unitedkingdom 10xCPU, 80GB, 1xA100 (80GB) no $1.595
azure westus3 24xCPU, 220GB, 1xA100 (80GB) no $3.673
azure westus2 24xCPU, 220GB, 1xA100 (80GB) no $3.673
Continue? [y/n]: y
Provisioning...
---> 100%
Service is published at https://yellow-cat-1.example.com
When dstack
submits the task, it uses the current folder contents.
.gitignore
If there are large files or folders you'd like to avoid uploading,
you can list them in .gitignore
.
The dstack run
command allows specifying many things, including spot policy, retry and max duration,
max price, regions, instance types, and much more.
Service endpoint¶
One the service is up, its endpoint is accessible at https://<run name>.<gateway domain>
.
Model endpoint
In case the service has the model mapping configured, you will also be able
to access the model at https://gateway.<gateway domain>
via the OpenAI-compatible interface.
Authorization¶
By default, the service endpoint requires the Authorization
header with "Bearer <dstack token>"
.
$ curl https://yellow-cat-1.example.com/generate \
-X POST \
-d '{"inputs":"<s>[INST] What is your favourite condiment?[/INST]"}' \
-H 'Content-Type: application/json' \
-H 'Authorization: "Bearer <dstack token>"'
Authorization can be disabled by setting auth
to false
in the service configuration file.
Managing runs¶
Stopping runs
When you use dstack stop
, the service and its cloud resources are deleted.
Listing runs
The dstack ps
command lists all running runs and their status.
What's next?¶
- Check the Text Generation Inference and vLLM examples
- Check the
.dstack.yml
reference for more details and examples - Browse examples