service
¶
The service
configuration type allows running services.
Root reference¶
port
- The port, that application listens on or the mapping.¶
gateway
- (Optional) The name of the gateway. Specify boolean false
to run without a gateway. Omit to run with the default gateway.¶
model
- (Optional) Mapping of the model for the OpenAI-compatible endpoint provided by dstack
. Can be a full model format definition or just a model name. If it's a name, the service is expected to expose an OpenAI-compatible API at the /v1
path.¶
https
- (Optional) Enable HTTPS if running with a gateway. Defaults to True
.¶
auth
- (Optional) Enable the authorization. Defaults to True
.¶
replicas
- (Optional) The number of replicas. Can be a number (e.g. 2
) or a range (0..4
or 1..8
). If it's a range, the scaling
property is required. Defaults to 1
.¶
scaling
- (Optional) The auto-scaling rules. Required if replicas
is set to a range.¶
name
- (Optional) The run name. If not specified, a random name is generated.¶
image
- (Optional) The name of the Docker image to run.¶
user
- (Optional) The user inside the container, user_name_or_id[:group_name_or_id]
(e.g., ubuntu
, 1000:1000
). Defaults to the default image
user.¶
privileged
- (Optional) Run the container in privileged mode.¶
entrypoint
- (Optional) The Docker entrypoint.¶
working_dir
- (Optional) The path to the working directory inside the container. It's specified relative to the repository directory (/workflow
) and should be inside it. Defaults to "."
.¶
registry_auth
- (Optional) Credentials for pulling a private Docker image.¶
python
- (Optional) The major version of Python. Mutually exclusive with image
.¶
nvcc
- (Optional) Use image with NVIDIA CUDA Compiler (NVCC) included. Mutually exclusive with image
.¶
env
- (Optional) The mapping or the list of environment variables.¶
resources
- (Optional) The resources requirements to run the configuration.¶
volumes
- (Optional) The volumes mount points.¶
commands
- (Optional) The bash commands to run.¶
backends
- (Optional) The backends to consider for provisioning (e.g., [aws, gcp]
).¶
regions
- (Optional) The regions to consider for provisioning (e.g., [eu-west-1, us-west4, westeurope]
).¶
instance_types
- (Optional) The cloud-specific instance types to consider for provisioning (e.g., [p3.8xlarge, n1-standard-4]
).¶
reservation
- (Optional) The existing reservation to use for instance provisioning. Supports AWS Capacity Reservations and Capacity Blocks.¶
spot_policy
- (Optional) The policy for provisioning spot or on-demand instances: spot
, on-demand
, or auto
. Defaults to on-demand
.¶
retry
- (Optional) The policy for resubmitting the run. Defaults to false
.¶
max_duration
- (Optional) The maximum duration of a run (e.g., 2h
, 1d
, etc). After it elapses, the run is forced to stop. Defaults to off
.¶
max_price
- (Optional) The maximum instance price per hour, in dollars.¶
creation_policy
- (Optional) The policy for using instances from the pool. Defaults to reuse-or-create
.¶
idle_duration
- (Optional) Time to wait before terminating idle instances. Defaults to 5m
for runs and 3d
for fleets. Use off
for unlimited duration.¶
termination_policy
- (Optional) Deprecated in favor of idle_duration
.¶
termination_idle_time
- (Optional) Deprecated in favor of idle_duration
.¶
model
¶
type
- The type of the model. Must be chat
.¶
name
- The name of the model.¶
format
- The serving format. Must be set to openai
.¶
prefix
- (Optional) The base_url
prefix (after hostname). Defaults to /v1
.¶
TGI provides an OpenAI-compatible API starting with version 1.4.0, so models served by TGI can be defined with
format: openai
too.
type
- The type of the model. Must be chat
.¶
name
- The name of the model.¶
format
- The serving format. Must be set to tgi
.¶
chat_template
- (Optional) The custom prompt template for the model. If not specified, the default prompt template from the HuggingFace Hub configuration will be used.¶
eos_token
- (Optional) The custom end of sentence token. If not specified, the default end of sentence token from the HuggingFace Hub configuration will be used.¶
Chat template
By default, dstack
loads the chat template
from the model's repository. If it is not present there, manual configuration is required.
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
commands:
- text-generation-launcher --port 8000 --trust-remote-code --quantize gptq
port: 8000
resources:
gpu: 80GB
# Enable the OpenAI-compatible endpoint
model:
type: chat
name: TheBloke/Llama-2-13B-chat-GPTQ
format: tgi
chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<s>[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' </s>' }}{% endif %}{% endfor %}"
eos_token: "</s>"
Please note that model mapping is an experimental feature with the following limitations:
- Doesn't work if your
chat_template
usesbos_token
. As a workaround, replacebos_token
insidechat_template
with the token content itself. - Doesn't work if
eos_token
is defined in the model repository as a dictionary. As a workaround, seteos_token
manually, as shown in the example above (see Chat template).
If you encounter any other issues, please make sure to file a GitHub issue .
scaling
¶
metric
- The target metric to track. Currently, the only supported value is rps
(meaning requests per second).¶
target
- The target value of the metric. The number of replicas is calculated based on this number and automatically adjusts (scales up or down) as this metric changes.¶
scale_up_delay
- (Optional) The delay in seconds before scaling up. Defaults to 300
.¶
scale_down_delay
- (Optional) The delay in seconds before scaling down. Defaults to 600
.¶
retry
¶
on_events
- The list of events that should be handled with retry. Supported events are no-capacity
, interruption
, and error
.¶
duration
- (Optional) The maximum period of retrying the run, e.g., 4h
or 1d
.¶
resources
¶
cpu
- (Optional) The number of CPU cores. Defaults to 2..
.¶
memory
- (Optional) The RAM size (e.g., 8GB
). Defaults to 8GB..
.¶
shm_size
- (Optional) The size of shared memory (e.g., 8GB
). If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure this.¶
gpu
- (Optional) The GPU requirements. Can be set to a number, a string (e.g. A100
, 80GB:2
, etc.), or an object.¶
disk
- (Optional) The disk resources.¶
resouces.gpu
¶
vendor
- (Optional) The vendor of the GPU/accelerator, one of: nvidia
, amd
, google
(alias: tpu
).¶
name
- (Optional) The GPU name or list of names.¶
count
- (Optional) The number of GPUs. Defaults to 1
.¶
memory
- (Optional) The RAM size (e.g., 16GB
). Can be set to a range (e.g. 16GB..
, or 16GB..80GB
).¶
total_memory
- (Optional) The total RAM size (e.g., 32GB
). Can be set to a range (e.g. 16GB..
, or 16GB..80GB
).¶
compute_capability
- (Optional) The minimum compute capability of the GPU (e.g., 7.5
).¶
resouces.disk
¶
size
- The disk size. Can be set to a range (e.g., 100GB..
or 100GB..200GB
).¶
registry_auth
¶
username
- The username.¶
password
- The password or access token.¶
volumes[n]
¶
Short syntax
The short syntax for volumes is a colon-separated string in the form of source:destination
volume-name:/container/path
for network volumes/instance/path:/container/path
for instance volumes