Supporting MPI and NCCL/RCCL tests¶
As AI models grow in complexity, efficient orchestration tools become increasingly important.
Fleets introduced by dstack last year streamline
task execution on both cloud and
on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.
The strength of dstack lies in its flexibility. Users can leverage distributed framework like
torchrun, accelerate, or others. dstack handles node provisioning, job execution, and automatically propagates
system environment variables—such as DSTACK_NODE_RANK, DSTACK_MASTER_NODE_IP,
DSTACK_GPUS_PER_NODE and others—to containers.

One use case dstack hasn’t supported until now is MPI, as it requires a scheduled environment or
direct SSH connections between containers. Since mpirun is essential for running NCCL/RCCL tests—crucial for large-scale
cluster usage—we’ve added support for it.
Below is an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).
type: task
name: nccl-tests
nodes: 2
image: dstackai/efa
env:
- NCCL_DEBUG=INFO
commands:
- |
# We use FIFO for inter-node communication
FIFO=/tmp/dstack_job
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
cd /root/nccl-tests/build
# Generate hostfile for mpirun
: > hostfile
for ip in ${DSTACK_NODES_IPS}; do
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
done
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
# Wait for other nodes
while true; do
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
break
fi
echo 'Waiting for nodes...'
sleep 5
done
# Run NCCL tests
${MPIRUN} \
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
--mca pml ^cm \
--mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0 \
--bind-to none \
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
# Notify nodes the job is done
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
else
mkfifo ${FIFO}
# Wait for a message from the first node
cat ${FIFO}
fi
resources:
gpu: nvidia:4:16GB
shm_size: 16GB
The master node (DSTACK_NODE_RANK=0) generates a hostfile listing all node IPs and waits until all nodes are
reachable via MPI. Once confirmed, it launches the /root/nccl-tests/build/all_reduce_perf benchmark across all available GPUs in the cluster.
Non-master nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
With this, now you can use such a task to run both NCCL or RCCL tests on both cloud and SSH fleets, as well as use MPI for other tasks.
The
dstackai/efaimage used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for AWS EFA , it can also be used with regular TCP/IP network adapters and InfiniBand. See the source code for the image.
What's next?
- Learn more about dev environments, tasks, services, and fleets
- Check the NCCL tests example
- Join Discord