Clusters¶
A cluster is a fleet with its placement
set to cluster
. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.
Fleets¶
Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet.
SSH fleets¶
SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.
For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers.
Cloud fleets¶
Cloud fleets allow to provision interconnected clusters across supported backends.
For cloud fleets, fast interconnect is currently supported only on the aws
, gcp
, and nebius
backends.
When you create a cloud fleet with AWS, Elastic Fabric Adapter networking is automatically configured if it’s supported for the corresponding instance type.
Backend configuration
Note, EFA requires the public_ips
to be set to false
in the aws
backend configuration.
Refer to the EFA example for more details.
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, GPUDirect-TCPXO and GPUDirect-TCPX networking is automatically configured.
When you create a cloud fleet with Nebius, InfiniBand networking is automatically configured if it’s supported for the corresponding instance type.
To request fast interconnect support for a other backends, file an issue .
NCCL/RCCL tests¶
To test the interconnect of a created fleet, ensure you run NCCL (for NVIDIA) or RCCL (for AMD) tests.
Distributed tasks¶
A distributed task is a task with nodes
set to a value greater than 2
. In this case, dstack
first ensures a
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
dstack
starts the rest of the nodes and runs the task container on each of them.
Within the task's commands
, it's possible to use DSTACK_MASTER_NODE_IP
, DSTACK_NODES_IPS
, DSTACK_NODE_RANK
, and other
system environment variables for inter-node communication.
Refer to distributed tasks for an example.
Retry policy
By default, if any of the nodes fails, dstack
terminates the entire run. Configure a retry policy to restart the run if any node fails.
Volumes¶
Network volumes¶
Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name interpolation syntax. This approach mounts a separate single-attach volume to each node.
Instance volumes¶
Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.
Instance volumes can be used to mount:
- Regular folders (data persists only while the fleet exists)
- Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems).
Refer to instance volumes for an example.
What's next?
- Read about distributed tasks, fleets, and volumes
- Browse the Clusters examples