Clusters¶

A cluster is a fleet with its placement set to cluster. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.

Fleets¶

Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet.

SSH fleets¶

SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.

For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers.

Cloud fleets¶

Cloud fleets allow to provision interconnected clusters across supported backends. For cloud fleets, fast interconnect is currently supported only on the aws, gcp, nebius, and runpod backends.

AWSGCPNebiusRunpod

When you create a cloud fleet with AWS, Elastic Fabric Adapter networking is automatically configured if it’s supported for the corresponding instance type.

Backend configuration

Note, EFA requires the public_ips to be set to false in the aws backend configuration. Refer to the EFA example for more details.

When you create a cloud fleet with GCP, dstack automatically configures GPUDirect-TCPXO and GPUDirect-TCPX networking for the A3 Mega and A3 High instance types, as well as RoCE networking for the A4 instance type.

Backend configuration

You may need to configure extra_vpcs and roce_vpcs in the gcp backend configuration. Refer to the A4, A3 Mega, and A3 High examples for more details.

When you create a cloud fleet with Nebius, InfiniBand networking is automatically configured if it’s supported for the corresponding instance type.

When you run multinode tasks in a cluster cloud fleet with Runpod, dstack provisions Runpod Instant Clusters with InfiniBand networking configured.

To request fast interconnect support for other backends, file an issue .

Distributed tasks¶

A distributed task is a task with nodes set to a value greater than 2. In this case, dstack first ensures a suitable fleet is available, then selects the master node (to obtain its IP) and finally runs jobs on each node.

Within the task's commands, it's possible to use DSTACK_MASTER_NODE_IP, DSTACK_NODES_IPS, DSTACK_NODE_RANK, and other system environment variables for inter-node communication.

MPI

If want to use MPI, you can set startup_order to workers-first and stop_criteria to master-done, and use DSTACK_MPI_HOSTFILE. See the NCCL or RCCL examples.

Retry policy

By default, if any of the nodes fails, dstack terminates the entire run. Configure a retry policy to restart the run if any node fails.

Refer to distributed tasks for an example.

NCCL/RCCL tests¶

To test the interconnect of a created fleet, ensure you run NCCL (for NVIDIA) or RCCL (for AMD) tests using MPI.

Volumes¶

Instance volumes¶

Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.

Instance volumes can be used to mount:

Regular folders (data persists only while the fleet exists)
Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems).

Network volumes¶

Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name interpolation syntax. This approach mounts a separate single-attach volume to each node.

What's next?

Read about distributed tasks, fleets, and volumes
Browse the Clusters and Distributed training examples