`service`¶

The service configuration type allows running services.

Root reference¶

`port` - (Required) `int | str | object` The port the application listens on.¶

`gateway` - (Optional) `bool | str` The name of the gateway. Specify boolean `false` to run without a gateway. Specify boolean `true` to run with the default gateway. Omit to run with the default gateway if there is one, or without a gateway otherwise.¶

`strip_prefix` - (Optional) `bool` Strip the `/proxy/services/<project name>/<run name>/` path prefix when forwarding requests to the service. Only takes effect when running the service without a gateway. Defaults to `True`.¶

`model` - (Optional) `str | object` Mapping of the model for the OpenAI-compatible endpoint provided by `dstack`. Can be a full model format definition or just a model name. If it's a name, the service is expected to expose an OpenAI-compatible API at the `/v1` path.¶

`https` - (Optional) `bool` Enable HTTPS if running with a gateway. Defaults to `True`.¶

`auth` - (Optional) `bool` Enable the authorization. Defaults to `True`.¶

`scaling` - (Optional) `object` The auto-scaling rules. Required if `replicas` is set to a range.¶

`rate_limits` - (Optional) `list[object]` Rate limiting rules.¶

`probes` - (Optional) `list[object]` The list of probes to determine service health. If `model` is set, defaults to a `/v1/chat/completions` probe. Set explicitly to override.¶

`replicas` - (Optional) `int | str | list[object]` The number of replicas or a list of replica groups. Can be an integer (e.g., `2`), a range (e.g., `0..4`), or a list of replica groups. Each replica group defines replicas with shared configuration (commands, resources, scaling). When `replicas` is a list of replica groups, top-level `scaling`, `commands`, and `resources` are not allowed and must be specified in each replica group instead. .¶

`commands` - (Optional) `list[str]` The shell commands to run.¶

`name` - (Optional) `str` The run name. If not specified, a random name is generated.¶

`image` - (Optional) `str` The name of the Docker image to run.¶

`user` - (Optional) `str` The user inside the container, `user_name_or_id[:group_name_or_id]` (e.g., `ubuntu`, `1000:1000`). Defaults to the default user from the `image`.¶

`privileged` - (Optional) `bool` Run the container in privileged mode.¶

`entrypoint` - (Optional) `str` The Docker entrypoint.¶

`working_dir` - (Optional) `str` The absolute path to the working directory inside the container. Defaults to the `image`'s default working directory.¶

`registry_auth` - (Optional) `object` Credentials for pulling a private Docker image.¶

`python` - (Optional) `"3.10" | "3.11" | "3.12" | "3.13" | "3.9"` The major version of Python. Mutually exclusive with `image` and `docker`.¶

`nvcc` - (Optional) `bool` Use image with NVIDIA CUDA Compiler (NVCC) included. Mutually exclusive with `image` and `docker`.¶

`single_branch` - (Optional) `bool` Whether to clone and track only the current branch or all remote branches. Relevant only when using remote Git repos. Defaults to `false` for dev environments and to `true` for tasks and services.¶

`env` - (Optional) `list[str] | dict` The mapping or the list of environment variables.¶

`shell` - (Optional) `str` The shell used to run commands. Allowed values are `sh`, `bash`, or an absolute path, e.g., `/usr/bin/zsh`. Defaults to `/bin/sh` if the `image` is specified, `/bin/bash` otherwise.¶

`resources` - (Optional) `object` The resources requirements to run the configuration.¶

`priority` - (Optional) `int` The priority of the run, an integer between `0` and `100`. `dstack` tries to provision runs with higher priority first. Defaults to `0`.¶

`volumes` - (Optional) `list[object]` The volumes mount points.¶

`docker` - (Optional) `bool` Use Docker inside the container. Mutually exclusive with `image`, `python`, and `nvcc`. Overrides `privileged`.¶

`repos` - (Optional) `list[object]` The list of Git repos.¶

`files` - (Optional) `list[object]` The local to container file path mappings.¶

`backends` - (Optional) `list["amddevcloud" | "aws" | "azure" | "cloudrift" | "cudo" | "datacrunch" | "digitalocean" | "dstack" | "gcp" | "hotaisle" | "kubernetes" | "lambda" | "local" | "remote" | "nebius" | "oci" | "runpod" | "tensordock" | "vastai" | "verda" | "vultr"]` The backends to consider for provisioning (e.g., `[aws, gcp]`).¶

`regions` - (Optional) `list[str]` The regions to consider for provisioning (e.g., `[eu-west-1, us-west4, westeurope]`).¶

`availability_zones` - (Optional) `list[str]` The availability zones to consider for provisioning (e.g., `[eu-west-1a, us-west4-a]`).¶

`instance_types` - (Optional) `list[str]` The cloud-specific instance types to consider for provisioning (e.g., `[p3.8xlarge, n1-standard-4]`).¶

`reservation` - (Optional) `str` The existing reservation to use for instance provisioning. Supports AWS Capacity Reservations, AWS Capacity Blocks, and GCP reservations.¶

`spot_policy` - (Optional) `"auto" | "on-demand" | "spot"` The policy for provisioning spot or on-demand instances: `spot`, `on-demand`, `auto`. Defaults to `on-demand`.¶

`retry` - (Optional) `bool | object` The policy for resubmitting the run. Defaults to `false`.¶

`max_duration` - (Optional) `int | str | "off"` The maximum duration of a run (e.g., `2h`, `1d`, etc) in a running state, excluding provisioning and pulling. After it elapses, the run is automatically stopped. Use `off` for unlimited duration. Defaults to `off`.¶

`stop_duration` - (Optional) `int | str | "off"` The maximum duration of a run graceful stopping. After it elapses, the run is automatically forced stopped. This includes force detaching volumes used by the run. Use `off` for unlimited duration. Defaults to `5m`.¶

`max_price` - (Optional) `float` The maximum instance price per hour, in dollars.¶

`creation_policy` - (Optional) `"reuse" | "reuse-or-create"` The policy for using instances from fleets: `reuse`, `reuse-or-create`. Defaults to `reuse-or-create`.¶

`idle_duration` - (Optional) `int | str` Time to wait before terminating idle instances. Instances are not terminated if the fleet is already at `nodes.min`. Defaults to `5m` for runs and `3d` for fleets. Use `off` for unlimited duration.¶

`utilization_policy` - (Optional) `object` Run termination policy based on utilization.¶

`startup_order` - (Optional) `"any" | "master-first" | "workers-first"` The order in which master and workers jobs are started: `any`, `master-first`, `workers-first`. Defaults to `any`.¶

`stop_criteria` - (Optional) `"all-done" | "master-done"` The criteria determining when a multi-node run should be considered finished: `all-done`, `master-done`. Defaults to `all-done`.¶

`schedule` - (Optional) `object` The schedule for starting the run at specified time.¶

`fleets` - (Optional) `list[str]` The fleets considered for reuse.¶

`tags` - (Optional) `dict` The custom tags to associate with the resource. The tags are also propagated to the underlying backend resources. If there is a conflict with backend-level tags, does not override them.¶

`model`¶

OpenAITGI

`type` - (Required) `"chat"` The type of the model. Must be `chat`.¶

`name` - (Required) `str` The name of the model.¶

`format` - (Required) `"openai"` The serving format. Must be set to `openai`.¶

`prefix` - (Optional) `str` The `base_url` prefix (after hostname). Defaults to `/v1`.¶

TGI provides an OpenAI-compatible API starting with version 1.4.0, so models served by TGI can be defined with format: openai too.

`type` - (Required) `"chat"` The type of the model. Must be `chat`.¶

`name` - (Required) `str` The name of the model.¶

`format` - (Required) `"tgi"` The serving format. Must be set to `tgi`.¶

`chat_template` - (Optional) `str` The custom prompt template for the model. If not specified, the default prompt template from the HuggingFace Hub configuration will be used.¶

`eos_token` - (Optional) `str` The custom end of sentence token. If not specified, the default end of sentence token from the HuggingFace Hub configuration will be used.¶

Chat template

By default, dstack loads the chat template from the model's repository. If it is not present there, manual configuration is required.

type: service

image: ghcr.io/huggingface/text-generation-inference:latest
env:
  - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
commands:
  - text-generation-launcher --port 8000 --trust-remote-code --quantize gptq
port: 8000

resources:
  gpu: 80GB

# Enable the OpenAI-compatible endpoint
model:
  type: chat
  name: TheBloke/Llama-2-13B-chat-GPTQ
  format: tgi
  chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<s>[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' </s>' }}{% endif %}{% endfor %}"
  eos_token: "</s>"

Please note that model mapping is an experimental feature with the following limitations:

Doesn't work if your chat_template uses bos_token. As a workaround, replace bos_token inside chat_template with the token content itself.
Doesn't work if eos_token is defined in the model repository as a dictionary. As a workaround, set eos_token manually, as shown in the example above (see Chat template).

If you encounter any ofther issues, please make sure to file a GitHub issue.

`scaling`¶

`metric` - (Required) `"rps"` The target metric to track. Currently, the only supported value is `rps` (meaning requests per second).¶

`target` - (Required) `float` The target value of the metric. The number of replicas is calculated based on this number and automatically adjusts (scales up or down) as this metric changes.¶

`scale_up_delay` - (Optional) `int | str` The delay in seconds before scaling up. Defaults to `300`.¶

`scale_down_delay` - (Optional) `int | str` The delay in seconds before scaling down. Defaults to `600`.¶

`rate_limits`¶

`rate_limits[n]`¶

`prefix` - (Optional) `str` URL path prefix to which this limit is applied. If an incoming request matches several prefixes, the longest prefix is applied. Defaults to `/`.¶

`key` - (Optional) `object` The partitioning key. Each incoming request belongs to a partition and rate limits are applied per partition. Defaults to partitioning by client IP address.¶

`rps` - (Required) `float` Max allowed number of requests per second. Requests are tracked at millisecond granularity. For example, `rps: 10` means at most 1 request per 100ms.¶

`burst` - (Optional) `int` Max number of requests that can be passed to the service ahead of the rate limit.¶

`rate_limits[n].key`¶

IP addressHeader

Partition requests by client IP address.

`type` - (Required) `"ip_address"` Partitioning type. Must be `ip_address`.¶

Partition requests by the value of a header.

`type` - (Required) `"header"` Partitioning type. Must be `header`.¶

`header` - (Required) `str` Name of the header to use for partitioning.¶

`probes`¶

`probes[n]`¶

`type` - (Required) `"http"` The probe type. Must be `http`.¶

`url` - (Optional) `str` The URL to request. Defaults to `/`.¶

`method` - (Optional) `"delete" | "get" | "head" | "patch" | "post" | "put"` The HTTP method to use for the probe (e.g., `get`, `post`, etc.). Defaults to `get`.¶

`headers` - (Optional) `list` A list of HTTP headers to include in the request.¶

`body` - (Optional) `str` The HTTP request body to send with the probe.¶

`timeout` - (Optional) `int | str` Maximum amount of time the HTTP request is allowed to take. Defaults to `10s`.¶

`interval` - (Optional) `int | str` Minimum amount of time between the end of one probe execution and the start of the next. Defaults to `15s`.¶

`ready_after` - (Optional) `int` The number of consecutive successful probe executions required for the replica to be considered ready. Used during rolling deployments. Defaults to `1`.¶

`until_ready` - (Optional) `bool` If `true`, the probe will stop being executed as soon as it reaches the `ready_after` threshold of successful executions. Defaults to `false`.¶

`probes[n].headers`¶

`probes[n].headers[m]`¶

`name` - (Required) `str` The name of the HTTP header.¶

`value` - (Required) `str` The value of the HTTP header.¶

`replicas`¶

`replicas[n]`¶

`name` - (Optional) `str` The name of the replica group. If not provided, defaults to '0', '1', etc. based on position..¶

`count` - (Required) `int | str` The number of replicas. Can be a number (e.g. `2`) or a range (`0..4` or `1..8`). If it's a range, the `scaling` property is required.¶

`scaling` - (Optional) `object` The auto-scaling rules. Required if `count` is set to a range.¶

`resources` - (Optional) `object` The resources requirements for replicas in this group.¶

`commands` - (Optional) `list[str]` The shell commands to run for replicas in this group.¶

`retry`¶

`on_events` - (Optional) `list["no-capacity" | "interruption" | "error"]` The list of events that should be handled with retry. Supported events are `no-capacity`, `interruption`, `error`. Omit to retry on all events.¶

`duration` - (Optional) `int | str` The maximum period of retrying the run, e.g., `4h` or `1d`. The period is calculated as a run age for `no-capacity` event and as a time passed since the last `interruption` and `error` for `interruption` and `error` events..¶

`utilization_policy`¶

`min_gpu_utilization` - (Required) `int` Minimum required GPU utilization, percent. If any GPU has utilization below specified value during the whole time window, the run is terminated.¶

`time_window` - (Required) `int | str` The time window of metric samples taking into account to measure utilization (e.g., `30m`, `1h`). Minimum is `5m`.¶

`schedule`¶

`cron` - (Required) `str | list[str]` A cron expression or a list of cron expressions specifying the UTC time when the run needs to be started.¶

`resources`¶

`cpu` - (Optional) `int | str | object` The CPU requirements.¶

`memory` - (Optional) `int | str` The RAM size (e.g., `8GB`). Defaults to `8GB..`.¶

`shm_size` - (Optional) `int | str` The size of shared memory (e.g., `8GB`). If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure this.¶

`gpu` - (Optional) `int | str | object` The GPU requirements.¶

`disk` - (Optional) `int | str | object` The disk resources.¶

`resources.cpu`¶

`arch` - (Optional) `"arm" | "x86"` The CPU architecture, one of: `x86`, `arm`.¶

`count` - (Optional) `int | str` The number of CPU cores. Defaults to `2..`.¶

`resources.gpu`¶

`vendor` - (Optional) `"amd" | "google" | "intel" | "nvidia" | "tenstorrent"` The vendor of the GPU/accelerator, one of: `nvidia`, `amd`, `google` (alias: `tpu`), `intel`.¶

`name` - (Optional) `str | list[str]` The name of the GPU (e.g., `A100` or `H100`).¶

`count` - (Optional) `int | str` The number of GPUs. Defaults to `1..`.¶

`memory` - (Optional) `int | str` The RAM size (e.g., `16GB`). Can be set to a range (e.g. `16GB..`, or `16GB..80GB`).¶

`total_memory` - (Optional) `int | str` The total RAM size (e.g., `32GB`). Can be set to a range (e.g. `16GB..`, or `16GB..80GB`).¶

`compute_capability` - (Optional) `float | str` The minimum compute capability of the GPU (e.g., `7.5`).¶

`resources.disk`¶

`size` - (Required) `int | str` Disk size.¶

`registry_auth`¶

`username` - (Required) `str` The username.¶

`password` - (Required) `str` The password or access token.¶

`volumes[n]`¶

Network volumesInstance volumes

`name` - (Required) `str | list[str]` The network volume name or the list of network volume names to mount. If a list is specified, one of the volumes in the list will be mounted. Specify volumes from different backends/regions to increase availability.¶

`path` - (Required) `str` The absolute container path to mount the volume at.¶

`instance_path` - (Required) `str` The absolute path on the instance (host).¶

`path` - (Required) `str` The absolute path in the container.¶

`optional` - (Optional) `bool` Allow running without this volume in backends that do not support instance volumes.¶

Short syntax

The short syntax for volumes is a colon-separated string in the form of source:destination

volume-name:/container/path for network volumes
/instance/path:/container/path for instance volumes

`repos[n]`¶

Currently, a maximum of one repo is supported.

Either local_path or url must be specified.

`local_path` - (Optional) `str` The path to the Git repo on the user's machine. Relative paths are resolved relative to the parent directory of the the configuration file. Mutually exclusive with `url`.¶

`url` - (Optional) `str` The Git repo URL. Mutually exclusive with `local_path`.¶

`branch` - (Optional) `str` The repo branch. Defaults to the active branch for local paths and the default branch for URLs.¶

`hash` - (Optional) `str` The commit hash.¶

`path` - (Optional) `str` The repo path inside the run container. Relative paths are resolved relative to the working directory. Defaults to `.`.¶

`if_exists` - (Optional) `"error" | "skip"` The action to be taken if `path` exists and is not empty. One of: `error`, `skip`. Defaults to `error`.¶

if_exists action

If the path already exists and is a non-empty directory, by default the run is terminated with an error. This can be changed with the if_exists option:

error – do not try to check out, terminate the run with an error (the default action since 0.20.0)
skip – do not try to check out, skip the repo (the only action available before 0.20.0)

Note, if the path exists and is not a directory (e.g., a regular file), this is always an error that cannot be ignored with the skip action.

Short syntax

The short syntax for repos is a colon-separated string in the form of local_path_or_url:path.

.:/repo
..:repo
~/repos/demo:~/repo
https://github.com/org/repo:~/data/repo
git@github.com:org/repo.git:data/repo

`files[n]`¶

`local_path` - (Required) `str` The path on the user's machine. Relative paths are resolved relative to the parent directory of the the configuration file.¶

`path` - (Required) `str` The path in the container. Relative paths are resolved relative to the working directory.¶

Short syntax

The short syntax for files is a colon-separated string in the form of local_path[:path] where path is optional and can be omitted if it's equal to local_path.

~/.bashrc, same as ~/.bashrc:~/.bashrc
/opt/myorg, same as /opt/myorg/ and /opt/myorg:/opt/myorg
libs/patched_libibverbs.so.1:/lib/x86_64-linux-gnu/libibverbs.so.1

service¶

Root reference¶

port - (Required) int | str | object The port the application listens on.¶

gateway - (Optional) bool | str The name of the gateway. Specify boolean false to run without a gateway. Specify boolean true to run with the default gateway. Omit to run with the default gateway if there is one, or without a gateway otherwise.¶

strip_prefix - (Optional) bool Strip the /proxy/services/<project name>/<run name>/ path prefix when forwarding requests to the service. Only takes effect when running the service without a gateway. Defaults to True.¶

model - (Optional) str | object Mapping of the model for the OpenAI-compatible endpoint provided by dstack. Can be a full model format definition or just a model name. If it's a name, the service is expected to expose an OpenAI-compatible API at the /v1 path.¶

https - (Optional) bool Enable HTTPS if running with a gateway. Defaults to True.¶

auth - (Optional) bool Enable the authorization. Defaults to True.¶

scaling - (Optional) object The auto-scaling rules. Required if replicas is set to a range.¶

rate_limits - (Optional) list[object] Rate limiting rules.¶

probes - (Optional) list[object] The list of probes to determine service health. If model is set, defaults to a /v1/chat/completions probe. Set explicitly to override.¶

commands - (Optional) list[str] The shell commands to run.¶

name - (Optional) str The run name. If not specified, a random name is generated.¶

image - (Optional) str The name of the Docker image to run.¶

user - (Optional) str The user inside the container, user_name_or_id[:group_name_or_id] (e.g., ubuntu, 1000:1000). Defaults to the default user from the image.¶

privileged - (Optional) bool Run the container in privileged mode.¶

entrypoint - (Optional) str The Docker entrypoint.¶

working_dir - (Optional) str The absolute path to the working directory inside the container. Defaults to the image's default working directory.¶

registry_auth - (Optional) object Credentials for pulling a private Docker image.¶

python - (Optional) "3.10" | "3.11" | "3.12" | "3.13" | "3.9" The major version of Python. Mutually exclusive with image and docker.¶

nvcc - (Optional) bool Use image with NVIDIA CUDA Compiler (NVCC) included. Mutually exclusive with image and docker.¶

single_branch - (Optional) bool Whether to clone and track only the current branch or all remote branches. Relevant only when using remote Git repos. Defaults to false for dev environments and to true for tasks and services.¶

env - (Optional) list[str] | dict The mapping or the list of environment variables.¶

shell - (Optional) str The shell used to run commands. Allowed values are sh, bash, or an absolute path, e.g., /usr/bin/zsh. Defaults to /bin/sh if the image is specified, /bin/bash otherwise.¶

resources - (Optional) object The resources requirements to run the configuration.¶

priority - (Optional) int The priority of the run, an integer between 0 and 100. dstack tries to provision runs with higher priority first. Defaults to 0.¶

volumes - (Optional) list[object] The volumes mount points.¶

docker - (Optional) bool Use Docker inside the container. Mutually exclusive with image, python, and nvcc. Overrides privileged.¶

repos - (Optional) list[object] The list of Git repos.¶

files - (Optional) list[object] The local to container file path mappings.¶

regions - (Optional) list[str] The regions to consider for provisioning (e.g., [eu-west-1, us-west4, westeurope]).¶

availability_zones - (Optional) list[str] The availability zones to consider for provisioning (e.g., [eu-west-1a, us-west4-a]).¶

instance_types - (Optional) list[str] The cloud-specific instance types to consider for provisioning (e.g., [p3.8xlarge, n1-standard-4]).¶

reservation - (Optional) str The existing reservation to use for instance provisioning. Supports AWS Capacity Reservations, AWS Capacity Blocks, and GCP reservations.¶

spot_policy - (Optional) "auto" | "on-demand" | "spot" The policy for provisioning spot or on-demand instances: spot, on-demand, auto. Defaults to on-demand.¶

retry - (Optional) bool | object The policy for resubmitting the run. Defaults to false.¶

max_duration - (Optional) int | str | "off" The maximum duration of a run (e.g., 2h, 1d, etc) in a running state, excluding provisioning and pulling. After it elapses, the run is automatically stopped. Use off for unlimited duration. Defaults to off.¶

stop_duration - (Optional) int | str | "off" The maximum duration of a run graceful stopping. After it elapses, the run is automatically forced stopped. This includes force detaching volumes used by the run. Use off for unlimited duration. Defaults to 5m.¶

max_price - (Optional) float The maximum instance price per hour, in dollars.¶

creation_policy - (Optional) "reuse" | "reuse-or-create" The policy for using instances from fleets: reuse, reuse-or-create. Defaults to reuse-or-create.¶

idle_duration - (Optional) int | str Time to wait before terminating idle instances. Instances are not terminated if the fleet is already at nodes.min. Defaults to 5m for runs and 3d for fleets. Use off for unlimited duration.¶

utilization_policy - (Optional) object Run termination policy based on utilization.¶

startup_order - (Optional) "any" | "master-first" | "workers-first" The order in which master and workers jobs are started: any, master-first, workers-first. Defaults to any.¶

stop_criteria - (Optional) "all-done" | "master-done" The criteria determining when a multi-node run should be considered finished: all-done, master-done. Defaults to all-done.¶

schedule - (Optional) object The schedule for starting the run at specified time.¶

fleets - (Optional) list[str] The fleets considered for reuse.¶

tags - (Optional) dict The custom tags to associate with the resource. The tags are also propagated to the underlying backend resources. If there is a conflict with backend-level tags, does not override them.¶

model¶

type - (Required) "chat" The type of the model. Must be chat.¶

name - (Required) str The name of the model.¶

format - (Required) "openai" The serving format. Must be set to openai.¶

prefix - (Optional) str The base_url prefix (after hostname). Defaults to /v1.¶

type - (Required) "chat" The type of the model. Must be chat.¶

name - (Required) str The name of the model.¶

format - (Required) "tgi" The serving format. Must be set to tgi.¶

chat_template - (Optional) str The custom prompt template for the model. If not specified, the default prompt template from the HuggingFace Hub configuration will be used.¶

eos_token - (Optional) str The custom end of sentence token. If not specified, the default end of sentence token from the HuggingFace Hub configuration will be used.¶

scaling¶

metric - (Required) "rps" The target metric to track. Currently, the only supported value is rps (meaning requests per second).¶

target - (Required) float The target value of the metric. The number of replicas is calculated based on this number and automatically adjusts (scales up or down) as this metric changes.¶

scale_up_delay - (Optional) int | str The delay in seconds before scaling up. Defaults to 300.¶

scale_down_delay - (Optional) int | str The delay in seconds before scaling down. Defaults to 600.¶

rate_limits¶

rate_limits[n]¶

prefix - (Optional) str URL path prefix to which this limit is applied. If an incoming request matches several prefixes, the longest prefix is applied. Defaults to /.¶

key - (Optional) object The partitioning key. Each incoming request belongs to a partition and rate limits are applied per partition. Defaults to partitioning by client IP address.¶

rps - (Required) float Max allowed number of requests per second. Requests are tracked at millisecond granularity. For example, rps: 10 means at most 1 request per 100ms.¶

burst - (Optional) int Max number of requests that can be passed to the service ahead of the rate limit.¶

rate_limits[n].key¶

type - (Required) "ip_address" Partitioning type. Must be ip_address.¶

type - (Required) "header" Partitioning type. Must be header.¶

header - (Required) str Name of the header to use for partitioning.¶

probes¶

probes[n]¶

type - (Required) "http" The probe type. Must be http.¶

url - (Optional) str The URL to request. Defaults to /.¶

method - (Optional) "delete" | "get" | "head" | "patch" | "post" | "put" The HTTP method to use for the probe (e.g., get, post, etc.). Defaults to get.¶

headers - (Optional) list A list of HTTP headers to include in the request.¶

body - (Optional) str The HTTP request body to send with the probe.¶

timeout - (Optional) int | str Maximum amount of time the HTTP request is allowed to take. Defaults to 10s.¶

`service`¶

`port` - (Required) `int | str | object` The port the application listens on.¶

`gateway` - (Optional) `bool | str` The name of the gateway. Specify boolean `false` to run without a gateway. Specify boolean `true` to run with the default gateway. Omit to run with the default gateway if there is one, or without a gateway otherwise.¶

`strip_prefix` - (Optional) `bool` Strip the `/proxy/services/<project name>/<run name>/` path prefix when forwarding requests to the service. Only takes effect when running the service without a gateway. Defaults to `True`.¶

`model` - (Optional) `str | object` Mapping of the model for the OpenAI-compatible endpoint provided by `dstack`. Can be a full model format definition or just a model name. If it's a name, the service is expected to expose an OpenAI-compatible API at the `/v1` path.¶

`https` - (Optional) `bool` Enable HTTPS if running with a gateway. Defaults to `True`.¶

`auth` - (Optional) `bool` Enable the authorization. Defaults to `True`.¶

`scaling` - (Optional) `object` The auto-scaling rules. Required if `replicas` is set to a range.¶

`rate_limits` - (Optional) `list[object]` Rate limiting rules.¶

`probes` - (Optional) `list[object]` The list of probes to determine service health. If `model` is set, defaults to a `/v1/chat/completions` probe. Set explicitly to override.¶

`commands` - (Optional) `list[str]` The shell commands to run.¶

`name` - (Optional) `str` The run name. If not specified, a random name is generated.¶

`image` - (Optional) `str` The name of the Docker image to run.¶

`user` - (Optional) `str` The user inside the container, `user_name_or_id[:group_name_or_id]` (e.g., `ubuntu`, `1000:1000`). Defaults to the default user from the `image`.¶

`privileged` - (Optional) `bool` Run the container in privileged mode.¶

`entrypoint` - (Optional) `str` The Docker entrypoint.¶

`working_dir` - (Optional) `str` The absolute path to the working directory inside the container. Defaults to the `image`'s default working directory.¶

`registry_auth` - (Optional) `object` Credentials for pulling a private Docker image.¶

`python` - (Optional) `"3.10" | "3.11" | "3.12" | "3.13" | "3.9"` The major version of Python. Mutually exclusive with `image` and `docker`.¶

`nvcc` - (Optional) `bool` Use image with NVIDIA CUDA Compiler (NVCC) included. Mutually exclusive with `image` and `docker`.¶

`single_branch` - (Optional) `bool` Whether to clone and track only the current branch or all remote branches. Relevant only when using remote Git repos. Defaults to `false` for dev environments and to `true` for tasks and services.¶

`env` - (Optional) `list[str] | dict` The mapping or the list of environment variables.¶

`shell` - (Optional) `str` The shell used to run commands. Allowed values are `sh`, `bash`, or an absolute path, e.g., `/usr/bin/zsh`. Defaults to `/bin/sh` if the `image` is specified, `/bin/bash` otherwise.¶

`resources` - (Optional) `object` The resources requirements to run the configuration.¶

`priority` - (Optional) `int` The priority of the run, an integer between `0` and `100`. `dstack` tries to provision runs with higher priority first. Defaults to `0`.¶

`volumes` - (Optional) `list[object]` The volumes mount points.¶

`docker` - (Optional) `bool` Use Docker inside the container. Mutually exclusive with `image`, `python`, and `nvcc`. Overrides `privileged`.¶

`repos` - (Optional) `list[object]` The list of Git repos.¶

`files` - (Optional) `list[object]` The local to container file path mappings.¶

`regions` - (Optional) `list[str]` The regions to consider for provisioning (e.g., `[eu-west-1, us-west4, westeurope]`).¶

`availability_zones` - (Optional) `list[str]` The availability zones to consider for provisioning (e.g., `[eu-west-1a, us-west4-a]`).¶

`instance_types` - (Optional) `list[str]` The cloud-specific instance types to consider for provisioning (e.g., `[p3.8xlarge, n1-standard-4]`).¶

`reservation` - (Optional) `str` The existing reservation to use for instance provisioning. Supports AWS Capacity Reservations, AWS Capacity Blocks, and GCP reservations.¶

`spot_policy` - (Optional) `"auto" | "on-demand" | "spot"` The policy for provisioning spot or on-demand instances: `spot`, `on-demand`, `auto`. Defaults to `on-demand`.¶

`retry` - (Optional) `bool | object` The policy for resubmitting the run. Defaults to `false`.¶

`max_duration` - (Optional) `int | str | "off"` The maximum duration of a run (e.g., `2h`, `1d`, etc) in a running state, excluding provisioning and pulling. After it elapses, the run is automatically stopped. Use `off` for unlimited duration. Defaults to `off`.¶

`stop_duration` - (Optional) `int | str | "off"` The maximum duration of a run graceful stopping. After it elapses, the run is automatically forced stopped. This includes force detaching volumes used by the run. Use `off` for unlimited duration. Defaults to `5m`.¶

`max_price` - (Optional) `float` The maximum instance price per hour, in dollars.¶

`creation_policy` - (Optional) `"reuse" | "reuse-or-create"` The policy for using instances from fleets: `reuse`, `reuse-or-create`. Defaults to `reuse-or-create`.¶

`idle_duration` - (Optional) `int | str` Time to wait before terminating idle instances. Instances are not terminated if the fleet is already at `nodes.min`. Defaults to `5m` for runs and `3d` for fleets. Use `off` for unlimited duration.¶

`utilization_policy` - (Optional) `object` Run termination policy based on utilization.¶

`startup_order` - (Optional) `"any" | "master-first" | "workers-first"` The order in which master and workers jobs are started: `any`, `master-first`, `workers-first`. Defaults to `any`.¶

`stop_criteria` - (Optional) `"all-done" | "master-done"` The criteria determining when a multi-node run should be considered finished: `all-done`, `master-done`. Defaults to `all-done`.¶

`schedule` - (Optional) `object` The schedule for starting the run at specified time.¶

`fleets` - (Optional) `list[str]` The fleets considered for reuse.¶

`tags` - (Optional) `dict` The custom tags to associate with the resource. The tags are also propagated to the underlying backend resources. If there is a conflict with backend-level tags, does not override them.¶

`model`¶

`type` - (Required) `"chat"` The type of the model. Must be `chat`.¶

`name` - (Required) `str` The name of the model.¶

`format` - (Required) `"openai"` The serving format. Must be set to `openai`.¶

`prefix` - (Optional) `str` The `base_url` prefix (after hostname). Defaults to `/v1`.¶

`type` - (Required) `"chat"` The type of the model. Must be `chat`.¶

`name` - (Required) `str` The name of the model.¶

`format` - (Required) `"tgi"` The serving format. Must be set to `tgi`.¶

`chat_template` - (Optional) `str` The custom prompt template for the model. If not specified, the default prompt template from the HuggingFace Hub configuration will be used.¶

`eos_token` - (Optional) `str` The custom end of sentence token. If not specified, the default end of sentence token from the HuggingFace Hub configuration will be used.¶

`scaling`¶

`metric` - (Required) `"rps"` The target metric to track. Currently, the only supported value is `rps` (meaning requests per second).¶

`target` - (Required) `float` The target value of the metric. The number of replicas is calculated based on this number and automatically adjusts (scales up or down) as this metric changes.¶

`scale_up_delay` - (Optional) `int | str` The delay in seconds before scaling up. Defaults to `300`.¶

`scale_down_delay` - (Optional) `int | str` The delay in seconds before scaling down. Defaults to `600`.¶

`rate_limits`¶

`rate_limits[n]`¶

`prefix` - (Optional) `str` URL path prefix to which this limit is applied. If an incoming request matches several prefixes, the longest prefix is applied. Defaults to `/`.¶

`key` - (Optional) `object` The partitioning key. Each incoming request belongs to a partition and rate limits are applied per partition. Defaults to partitioning by client IP address.¶

`rps` - (Required) `float` Max allowed number of requests per second. Requests are tracked at millisecond granularity. For example, `rps: 10` means at most 1 request per 100ms.¶

`burst` - (Optional) `int` Max number of requests that can be passed to the service ahead of the rate limit.¶

`rate_limits[n].key`¶

`type` - (Required) `"ip_address"` Partitioning type. Must be `ip_address`.¶

`type` - (Required) `"header"` Partitioning type. Must be `header`.¶

`header` - (Required) `str` Name of the header to use for partitioning.¶

`probes`¶

`probes[n]`¶

`type` - (Required) `"http"` The probe type. Must be `http`.¶

`url` - (Optional) `str` The URL to request. Defaults to `/`.¶

`method` - (Optional) `"delete" | "get" | "head" | "patch" | "post" | "put"` The HTTP method to use for the probe (e.g., `get`, `post`, etc.). Defaults to `get`.¶

`headers` - (Optional) `list` A list of HTTP headers to include in the request.¶

`body` - (Optional) `str` The HTTP request body to send with the probe.¶

`timeout` - (Optional) `int | str` Maximum amount of time the HTTP request is allowed to take. Defaults to `10s`.¶

`interval` - (Optional) `int | str` Minimum amount of time between the end of one probe execution and the start of the next. Defaults to `15s`.¶