Skip to content

fleet

The fleet configuration type allows creating and updating fleets.

Root reference

name - (Optional) str The fleet name.
env - (Optional) list[str] | dict The mapping or the list of environment variables.
ssh_config - (Optional) object The parameters for adding instances via SSH.
nodes - (Optional) int | str | object The number of instances in cloud fleet.
placement - (Optional) "any" | "cluster" The placement of instances: any or cluster.
reservation - (Optional) str The existing reservation to use for instance provisioning. Supports AWS Capacity Reservations, AWS Capacity Blocks, and GCP reservations.
resources - (Optional) object The resources requirements.
blocks - (Optional) int | "auto" The amount of blocks to split the instance into, a number or auto. auto means as many as possible. The number of GPUs and CPUs must be divisible by the number of blocks. Defaults to 1, i.e. do not split. Defaults to 1.
backends - (Optional) list["amddevcloud" | "aws" | "azure" | "cloudrift" | "cudo" | "datacrunch" | "digitalocean" | "dstack" | "gcp" | "hotaisle" | "kubernetes" | "lambda" | "local" | "remote" | "nebius" | "oci" | "runpod" | "tensordock" | "vastai" | "verda" | "vultr"] The backends to consider for provisioning (e.g., [aws, gcp]).
regions - (Optional) list[str] The regions to consider for provisioning (e.g., [eu-west-1, us-west4, westeurope]).
availability_zones - (Optional) list[str] The availability zones to consider for provisioning (e.g., [eu-west-1a, us-west4-a]).
instance_types - (Optional) list[str] The cloud-specific instance types to consider for provisioning (e.g., [p3.8xlarge, n1-standard-4]).
spot_policy - (Optional) "auto" | "on-demand" | "spot" The policy for provisioning spot or on-demand instances: spot, on-demand, auto. Defaults to on-demand.
retry - (Optional) bool | object The policy for provisioning retry. Defaults to false.
max_price - (Optional) float The maximum instance price per hour, in dollars.
idle_duration - (Optional) int | str Time to wait before terminating idle instances. Instances are not terminated if the fleet is already at nodes.min. Defaults to 5m for runs and 3d for fleets. Use off for unlimited duration.
tags - (Optional) dict The custom tags to associate with the resource. The tags are also propagated to the underlying backend resources. If there is a conflict with backend-level tags, does not override them.

ssh_config

user - (Optional) str The user to log in with on all hosts.
port - (Optional) int The SSH port to connect to.
identity_file - (Optional) str The private key to use for all hosts.
proxy_jump - (Optional) object The SSH proxy configuration for all hosts.
hosts - (Required) list[str | object] The per host connection parameters: a hostname or an object that overrides default ssh parameters.
network - (Optional) str The network address for cluster setup in the format <ip>/<netmask>. dstack will use IP addresses from this network for communication between hosts. If not specified, dstack will use IPs from the first found internal network..

ssh_config.proxy_jump

hostname - (Required) str The IP address or domain of proxy host.
port - (Optional) int The SSH port of proxy host.
user - (Required) str The user to log in with for proxy host.
identity_file - (Required) str The private key to use for proxy host.

ssh_config.hosts[n]

hostname - (Required) str The IP address or domain to connect to.
port - (Optional) int The SSH port to connect to for this host.
user - (Optional) str The user to log in with for this host.
identity_file - (Optional) str The private key to use for this host.
proxy_jump - (Optional) object The SSH proxy configuration for this host.
internal_ip - (Optional) str The internal IP of the host used for communication inside the cluster. If not specified, dstack will use the IP address from network or from the first found internal network..
blocks - (Optional) int | "auto" The amount of blocks to split the instance into, a number or auto. auto means as many as possible. The number of GPUs and CPUs must be divisible by the number of blocks. Defaults to 1, i.e. do not split. Defaults to 1.
ssh_config.hosts[n].proxy_jump
hostname - (Required) str The IP address or domain of proxy host.
port - (Optional) int The SSH port of proxy host.
user - (Required) str The user to log in with for proxy host.
identity_file - (Required) str The private key to use for proxy host.

resources

cpu - (Optional) int | str | object The CPU requirements.
memory - (Optional) int | str The RAM size (e.g., 8GB). Defaults to 8GB...
shm_size - (Optional) int | str The size of shared memory (e.g., 8GB). If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure this.
gpu - (Optional) int | str | object The GPU requirements.
disk - (Optional) int | str | object The disk resources.

resources.cpu

arch - (Optional) "arm" | "x86" The CPU architecture, one of: x86, arm.
count - (Optional) int | str The number of CPU cores. Defaults to 2...

resources.gpu

vendor - (Optional) "amd" | "google" | "intel" | "nvidia" | "tenstorrent" The vendor of the GPU/accelerator, one of: nvidia, amd, google (alias: tpu), intel.
name - (Optional) str | list[str] The name of the GPU (e.g., A100 or H100).
count - (Optional) int | str The number of GPUs. Defaults to 1...
memory - (Optional) int | str The RAM size (e.g., 16GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).
total_memory - (Optional) int | str The total RAM size (e.g., 32GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).
compute_capability - (Optional) float | str The minimum compute capability of the GPU (e.g., 7.5).

resources.disk

size - (Required) int | str Disk size.

retry

on_events - (Optional) list["no-capacity" | "interruption" | "error"] The list of events that should be handled with retry. Supported events are no-capacity, interruption, error. Omit to retry on all events.
duration - (Optional) int | str The maximum period of retrying the run, e.g., 4h or 1d. The period is calculated as a run age for no-capacity event and as a time passed since the last interruption and error for interruption and error events..