Skip to content

fleet

The fleet configuration type allows creating and updating fleets.

Configuration files must be inside the project repo, and their names must end with .dstack.yml (e.g. .dstack.yml or fleet.dstack.yml are both acceptable). Any configuration can be run via dstack apply.

Examples

Cloud

type: fleet
# The name is optional, if not specified, generated randomly
name: my-fleet

# The number of instances
nodes: 4
# Ensure the instances are interconnected
placement: cluster

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu:
    # 24GB or more vRAM
    memory: 24GB..
    # One or more GPU
    count: 1..

SSH

type: fleet
# The name is optional, if not specified, generated randomly
name: my-ssh-fleet

# Ensure instances are interconnected
placement: cluster

# The user, private SSH key, and hostnames of the on-prem servers
ssh_config:
  user: ubuntu
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 3.255.177.51
    - 3.255.177.52

Root reference

name - (Optional) The fleet name.

env - (Optional) The mapping or the list of environment variables.

ssh_config - (Optional) The parameters for adding instances via SSH.

nodes - (Optional) The number of instances.

placement - (Optional) The placement of instances: any or cluster.

resources - (Optional) The resources requirements.

backends - (Optional) The backends to consider for provisioning (e.g., [aws, gcp]).

regions - (Optional) The regions to consider for provisioning (e.g., [eu-west-1, us-west4, westeurope]).

instance_types - (Optional) The cloud-specific instance types to consider for provisioning (e.g., [p3.8xlarge, n1-standard-4]).

spot_policy - (Optional) The policy for provisioning spot or on-demand instances: spot, on-demand, or auto.

retry - (Optional) The policy for provisioning retry. Defaults to false.

max_price - (Optional) The maximum instance price per hour, in dollars.

termination_policy - (Optional) The policy for instance termination. Defaults to destroy-after-idle.

termination_idle_time - (Optional) Time to wait before destroying idle instances. Defaults to 3d.

ssh_config

user - (Optional) The user to log in with on all hosts.

port - (Optional) The SSH port to connect to.

identity_file - (Optional) The private key to use for all hosts.

hosts - The per host connection parameters: a hostname or an object that overrides default ssh parameters.

network - (Optional) The network address for cluster setup in the format <ip>/<netmask>.

ssh_config.hosts[n]

hostname - The IP address or domain to connect to.

port - (Optional) The SSH port to connect to for this host.

user - (Optional) The user to log in with for this host.

identity_file - (Optional) The private key to use for this host.

resources

cpu - (Optional) The number of CPU cores. Defaults to 2...

memory - (Optional) The RAM size (e.g., 8GB). Defaults to 8GB...

shm_size - (Optional) The size of shared memory (e.g., 8GB). If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure this.

gpu - (Optional) The GPU requirements. Can be set to a number, a string (e.g. A100, 80GB:2, etc.), or an object.

disk - (Optional) The disk resources.

resouces.gpu

vendor - (Optional) The vendor of the GPU/accelerator, one of: nvidia, amd, google (alias: tpu).

name - (Optional) The GPU name or list of names.

count - (Optional) The number of GPUs. Defaults to 1.

memory - (Optional) The RAM size (e.g., 16GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

total_memory - (Optional) The total RAM size (e.g., 32GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

compute_capability - (Optional) The minimum compute capability of the GPU (e.g., 7.5).

resouces.disk

size - The disk size. Can be a string (e.g., 100GB or 100GB..) or an object.

retry

on_events - The list of events that should be handled with retry. Supported events are no-capacity, interruption, and error.

duration - (Optional) The maximum period of retrying the run, e.g., 4h or 1d.