Crusoe¶
dstack allows using Crusoe clusters with fast interconnect via two ways:
- VMs – If you configure a
crusoebackend indstackby providing your Crusoe credentials,dstacklets you fully provision and use clusters throughdstack. - Kubernetes – If you create a Kubernetes cluster on Crusoe and configure a
kubernetesbackend and create a backend fleet indstack,dstacklets you fully use this cluster throughdstack.
VMs¶
Since dstack offers a VM-based backend that natively integrates with Crusoe, you only need to provide your Crusoe credentials to dstack, and it will allow you to fully provision and use clusters on Crusoe through dstack.
Configure a backend¶
Log into your Crusoe console, create an API key under your account settings, and note your project ID.
projects:
- name: main
backends:
- type: crusoe
project_id: your-project-id
creds:
type: access_key
access_key: your-access-key
secret_key: your-secret-key
Create a fleet¶
Once the backend is configured, you can create a fleet:
type: fleet
name: crusoe-fleet
nodes: 2
placement: cluster
backends: [crusoe]
resources:
gpu: A100:80GB:8
Pass the fleet configuration to dstack apply:
$ dstack apply -f crusoe-fleet.dstack.yml
This will automatically create an IB partition and provision instances with InfiniBand networking.
Once the fleet is created, you can run dev environments, tasks, and services.
If you want instances to be provisioned on demand, you can set
nodesto0..2. In this case,dstackwill create instances only when you run workloads.
Kubernetes¶
Create a cluster¶
- Go
Networking→Firewall Rules, clickCreate Firewall Rule, and allow ingress traffic on port30022. This port will be used by thedstackserver to access the jump host. - Go to
Orchestrationand clickCreate Cluster. Make sure to enable theNVIDIA GPU Operatoradd-on. - Go the the cluster, and click
Create Node Pool. Select the right type of the instance, andDesired Number of Nodes. - Wait until nodes are provisioned.
Even if you enable
autoscaling,dstackcan use only the nodes that are already provisioned.
Configure the backend¶
Follow the standard instructions for setting up a kubernetes backend:
projects:
- name: main
backends:
- type: kubernetes
kubeconfig:
filename: <kubeconfig path>
proxy_jump:
port: 30022
Create a fleet¶
Once the Crusoe Managed Kubernetes cluster and the dstack server are running, you can create a fleet:
type: fleet
name: crusoe-fleet
placement: cluster
nodes: 0..
backends: [kubernetes]
resources:
# Specify requirements to filter nodes
gpu: 8
Pass the fleet configuration to dstack apply:
$ dstack apply -f crusoe-fleet.dstack.yml
Once the fleet is created, you can run dev environments, tasks, and services.
NCCL tests¶
Use a distributed task that runs NCCL tests to validate cluster network bandwidth.
With the Crusoe backend, HPC-X and NCCL topology files are pre-installed on the host VM image. Mount them into the container via instance volumes.
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
volumes:
- /opt/hpcx:/opt/hpcx
- /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo
commands:
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_HCA=^mlx5_0:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
backends: [crusoe]
resources:
gpu: A100:80GB:8
shm_size: 16GB
Update
NCCL_TOPO_FILEto match your instance type. Topology files for all supported types are available at/etc/crusoe/nccl_topo/on the host.
If you're running on Crusoe Managed Kubernetes, make sure to install HPC-X and provide an up-to-date topology file.
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
commands:
# Install NCCL topology files
- curl -sSL https://gist.github.com/un-def/48df8eea222fa9547ad4441986eb15af/archive/df51d56285c5396a0e82bb42f4f970e7bb0a9b65.tar.gz -o nccl_topo.tar.gz
- mkdir -p /etc/crusoe/nccl_topo
- tar -C /etc/crusoe/nccl_topo -xf nccl_topo.tar.gz --strip-components=1
# Install and initialize HPC-X
- curl -sSL https://content.mellanox.com/hpc/hpc-x/v2.21.3/hpcx-v2.21.3-gcc-doca_ofed-ubuntu22.04-cuda12-x86_64.tbz -o hpcx.tar.bz
- mkdir -p /opt/hpcx
- tar -C /opt/hpcx -xf hpcx.tar.bz --strip-components=1 --checkpoint=10000
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
# Run NCCL Tests
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_AR_THRESHOLD=0 \
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
-x NCCL_IB_QPS_PER_CONNECTION=2 \
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
-x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
# Required for IB
privileged: true
resources:
gpu: A100:8
shm_size: 16GB
The task above downloads an A100 topology file from a Gist. The most reliable way to obtain the latest topology is to copy it from a Crusoe-provisioned VM (see VMs).
Privileged
When running on Crusoe Managed Kubernetes, set privileged to true to ensure access to InfiniBand.
Pass the configuration to dstack apply:
$ dstack apply -f crusoe-nccl-tests.dstack.yml
What's next¶
- Learn about dev environments, tasks, services
- Check out backends and fleets
- Check the docs on Crusoe's networking and "Crusoe Managed" Kubernetes