RCCL tests¶
This example shows how to run distributed RCCL tests with MPI using dstack
.
Running as a task¶
Here's an example of a task that runs AllReduce test on 2 nodes, each with 8 Mi300x
GPUs (16 processes in total).
type: task
name: rccl-tests
nodes: 2
# Mount the system libraries folder from the host
volumes:
- /usr/local/lib:/mnt/lib
image: rocm/dev-ubuntu-22.04:6.4-complete
env:
- NCCL_DEBUG=INFO
- OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
commands:
# Setup MPI and build RCCL tests
- apt-get install -y git libopenmpi-dev openmpi-bin
- git clone https://github.com/ROCm/rccl-tests.git
- cd rccl-tests
- make MPI=1 MPI_HOME=${OPEN_MPI_HOME}
# Preload the RoCE driver library from the host (for Broadcom driver compatibility)
- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
# Run RCCL tests via MPI
- |
FIFO=/tmp/${DSTACK_RUN_NAME}
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
sleep 10
echo "$DSTACK_NODES_IPS" | tr ' ' '\n' > hostfile
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
# Wait for other nodes
while true; do
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
break
fi
echo 'Waiting for worker nodes...'
sleep 5
done
# Run NCCL Tests
${MPIRUN} \
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
--mca btl_tcp_if_include ens41np0 \
-x LD_PRELOAD \
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
# Notify worker nodes the MPI run is finished
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
else
mkfifo ${FIFO}
# Wait for a message from the master node
cat ${FIFO}
fi
resources:
gpu: MI300X:8
MPI
RCCL tests rely on MPI to run on multiple processes. The master node (DSTACK_NODE_RANK=0
) generates hostfile
(using DSTACK_NODES_IPS
)
and waits until worker nodes are accessible via MPI.
Then, it executes /rccl-tests/build/all_reduce_perf
across all GPUs.
Worker nodes use a FIFO
pipe to wait for until the MPI run is finished.
There is an open issue to simplify the use of MPI with distributed tasks.
RoCE library
Broadcom RoCE drivers require the libbnxt_re
userspace library inside the container to be compatible with the host’s Broadcom
kernel driver bnxt_re
. To ensure this compatibility, we mount libbnxt_re-rdmav34.so
from the host and preload it
using LD_PRELOAD
when running MPI.
Creating a fleet¶
Define an SSH fleet configuration by listing the IP addresses of each node in the cluster, along with the SSH user and SSH key configured for each host.
type: fleet
# The name is optional, if not specified, generated randomly
name: mi300x-fleet
# SSH credentials for the on-prem servers
ssh_config:
user: root
identity_file: ~/.ssh/id_rsa
hosts:
- 144.202.58.28
- 137.220.58.52
Apply a configuration¶
To run a configuration, use the dstack apply
command.
$ dstack apply -f examples/distributed-training/rccl-tests/.dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 ssh (remote) cpu=256 mem=2268GB disk=752GB instance $0 idle
MI300X:192GB:8
2 ssh (remote) cpu=256 mem=2268GB disk=752GB instance $0 idle
MI300X:192GB:8
Submit the run rccl-tests? [y/n]: y
Source code¶
The source-code of this example can be found in
examples/distributed-training/rccl-tests
.
What's next?¶
- Check dev environments, tasks, services, and fleets.