Skip to content

Clusters

A cluster is a fleet with its placement set to cluster. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.

Fleets

Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet.

SSH fleets

SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.

For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers.

Cloud fleets

Cloud fleets allow to provision interconnected clusters across supported backends. For cloud fleets, fast interconnect is currently supported only on the aws, gcp, and nebius backends.

When you create a cloud fleet with AWS, Elastic Fabric Adapter networking is automatically configured if it’s supported for the corresponding instance type.

Backend configuration

Note, EFA requires the public_ips to be set to false in the aws backend configuration. Refer to the EFA example for more details.

When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, GPUDirect-TCPXO and GPUDirect-TCPX networking is automatically configured.

Backend configuration

Note, GPUDirect-TCPXO and GPUDirect-TCPX require extra_vpcs to be configured in the gcp backend configuration. Refer to the A3 Mega and A3 High examples for more details.

When you create a cloud fleet with Nebius, InfiniBand networking is automatically configured if it’s supported for the corresponding instance type.

To request fast interconnect support for a other backends, file an issue .

NCCL/RCCL tests

To test the interconnect of a created fleet, ensure you run NCCL (for NVIDIA) or RCCL (for AMD) tests.

Distributed tasks

A distributed task is a task with nodes set to a value greater than 2. In this case, dstack first ensures a suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up, dstack starts the rest of the nodes and runs the task container on each of them.

Within the task's commands, it's possible to use DSTACK_MASTER_NODE_IP, DSTACK_NODES_IPS, DSTACK_NODE_RANK, and other system environment variables for inter-node communication.

Refer to distributed tasks for an example.

Retry policy

By default, if any of the nodes fails, dstack terminates the entire run. Configure a retry policy to restart the run if any node fails.

Volumes

Network volumes

Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name interpolation syntax. This approach mounts a separate single-attach volume to each node.

Instance volumes

Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.

Instance volumes can be used to mount:

  • Regular folders (data persists only while the fleet exists)
  • Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems).

Refer to instance volumes for an example.

What's next?

  1. Read about distributed tasks, fleets, and volumes
  2. Browse the Clusters examples