GCP¶
This example shows how to create and use clusters on GCP.
dstack supports the following instance types:
| Instance type | GPU | Maximum bandwidth | Fabric |
|---|---|---|---|
| A3 Edge | H100:8 | 0.8 Tbps | GPUDirect-TCPX |
| A3 High | H100:8 | 1 Tbps | GPUDirect-TCPX |
| A3 Mega | H100:8 | 1.8 Tbps | GPUDirect-TCPXO |
| A4 | B200:8 | 3.2 Tbps | RoCE |
Configure the backend¶
Despite hiding most of the complexity, dstack still requires instance-specific backend configuration:
A4 requires one extra_vpcs for inter-node traffic (regular VPC, one subnet) and one roce_vpcs for GPU-to-GPU communication (RoCE profile, eight subnets).
projects:
- name: main
backends:
- type: gcp
# Specify your GCP project ID
project_id: <project id>
extra_vpcs: [dstack-gvnic-net-1]
roce_vpcs: [dstack-mrdma]
# Specify the regions you intend to use
regions: [us-west2]
creds:
type: default
Create extra and RoCE VPCs
See GCP's RoCE network setup guide for the commands to create VPCs and filewall rules.
Ensure VPCs allow internal traffic between nodes for MPI/NCCL to function.
A3 Edge/High require at least 4 extra_vpcs for data NICs.
projects:
- name: main
backends:
- type: gcp
# Specify your GCP project ID
project_id: <project id>
extra_vpcs:
- dstack-gpu-data-net-1
- dstack-gpu-data-net-2
- dstack-gpu-data-net-3
- dstack-gpu-data-net-4
- dstack-gpu-data-net-5
- dstack-gpu-data-net-6
- dstack-gpu-data-net-7
- dstack-gpu-data-net-8
# Specify the regions you intend to use
regions: [europe-west4]
creds:
type: default
Create extra VPCs
Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule:
# Specify the region where you intend to deploy the cluster
REGION="europe-west4"
for N in $(seq 1 8); do
gcloud compute networks create dstack-gpu-data-net-$N \
--subnet-mode=custom \
--mtu=8244
gcloud compute networks subnets create dstack-gpu-data-sub-$N \
--network=dstack-gpu-data-net-$N \
--region=$REGION \
--range=192.168.$N.0/24
gcloud compute firewall-rules create dstack-gpu-data-internal-$N \
--network=dstack-gpu-data-net-$N \
--action=ALLOW \
--rules=tcp:0-65535,udp:0-65535,icmp \
--source-ranges=192.168.0.0/16
done
A3 Edge/High require at least 4 extra_vpcs for data NICs and a vm_service_account authorized to pull GPUDirect Docker images.
projects:
- name: main
backends:
- type: gcp
# Specify your GCP project ID
project_id: <project id>
extra_vpcs:
- dstack-gpu-data-net-1
- dstack-gpu-data-net-2
- dstack-gpu-data-net-3
- dstack-gpu-data-net-4
# Specify the regions you intend to use
regions: [europe-west4]
# Specify your GCP project ID
vm_service_account: a3cluster-sa@$<project id>.iam.gserviceaccount.com
creds:
type: default
Create extra VPCs
Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule:
# Specify the region where you intend to deploy the cluster
REGION="europe-west4"
for N in $(seq 1 4); do
gcloud compute networks create dstack-gpu-data-net-$N \
--subnet-mode=custom \
--mtu=8244
gcloud compute networks subnets create dstack-gpu-data-sub-$N \
--network=dstack-gpu-data-net-$N \
--region=$REGION \
--range=192.168.$N.0/24
gcloud compute firewall-rules create dstack-gpu-data-internal-$N \
--network=dstack-gpu-data-net-$N \
--action=ALLOW \
--rules=tcp:0-65535,udp:0-65535,icmp \
--source-ranges=192.168.0.0/16
done
Create a service account
Create a VM service account that allows VMs to access the pkg.dev registry:
PROJECT_ID=$(gcloud config get-value project)
gcloud iam service-accounts create a3cluster-sa \
--display-name "Service Account for pulling GCR images"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:a3cluster-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
--role="roles/artifactregistry.reader"
Default VPC
If you set a non-default vpc_name in the backend configuration, ensure it allows all inter-node traffic. This is required for MPI and NCCL. The default VPC already allows this.
Create a fleet¶
Once you've configured the gcp backend, create the fleet configuration:
type: fleet
name: a4-fleet
placement: cluster
# Can be a range on a fixed number
nodes: 2
# Specify the zone where you have configured the RoCE VPC
availability_zones: [us-west2-c]
backends: [gcp]
# Uncomment to allow spot instances
#spot_policy: auto
resources:
gpu: B200:8
Then apply it with dstack apply:
$ dstack apply -f examples/clusters/gcp/a4-fleet.dstack.yml
Provisioning...
---> 100%
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
a4-fleet 0 gcp (us-west2) B200:180GB:8 (spot) $51.552 idle 9 mins ago
1 gcp (us-west2) B200:180GB:8 (spot) $51.552 idle 9 mins ago
type: fleet
name: a3mega-fleet
placement: cluster
# Can be a range on a fixed number
nodes: 2
instance_types:
- a3-megagpu-8g
# Uncomment to allow spot instances
#spot_policy: auto
Pass the configuration to dstack apply:
$ dstack apply -f examples/clusters/gcp/a3mega-fleet.dstack.yml
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
a3mega-fleet 1 gcp (europe-west4) H100:80GB:8 $22.1525 (spot) idle 9 mins ago
a3mega-fleet 2 gcp (europe-west4) H100:80GB:8 $64.2718 idle 9 mins ago
Create the fleet? [y/n]: y
Provisioning...
---> 100%
type: fleet
name: a3high-fleet
placement: cluster
nodes: 2
instance_types:
- a3-highgpu-8g
# Uncomment to allow spot instances
#spot_policy: auto
Pass the configuration to dstack apply:
$ dstack apply -f examples/clusters/gcp/a3high-fleet.dstack.yml
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
a3mega-fleet 1 gcp (europe-west4) H100:80GB:8 $20.5688 (spot) idle 9 mins ago
a3mega-fleet 2 gcp (europe-west4) H100:80GB:8 $58.5419 idle 9 mins ago
Create the fleet? [y/n]: y
Provisioning...
---> 100%
Once the fleet is created, you can run distributed tasks, in addition to dev environments, services, and regular tasks.
Run tasks¶
NCCL tests¶
Use a distributed task that runs NCCL tests to validate cluster network bandwidth.
Pass the configuration to dstack apply:
$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml
Provisioning...
---> 100%
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
size count type redop root time algbw busbw wrong time algbw busbw wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 2097152 float sum -1 156.9 53.47 100.25 0 167.6 50.06 93.86 0
16777216 4194304 float sum -1 196.3 85.49 160.29 0 206.2 81.37 152.57 0
33554432 8388608 float sum -1 258.5 129.82 243.42 0 261.8 128.18 240.33 0
67108864 16777216 float sum -1 369.4 181.69 340.67 0 371.2 180.79 338.98 0
134217728 33554432 float sum -1 638.5 210.22 394.17 0 587.2 228.57 428.56 0
268435456 67108864 float sum -1 940.3 285.49 535.29 0 950.7 282.36 529.43 0
536870912 134217728 float sum -1 1695.2 316.70 593.81 0 1666.9 322.08 603.89 0
1073741824 268435456 float sum -1 3229.9 332.44 623.33 0 3201.8 335.35 628.78 0
2147483648 536870912 float sum -1 6107.7 351.61 659.26 0 6157.1 348.78 653.97 0
4294967296 1073741824 float sum -1 11952 359.36 673.79 0 11942 359.65 674.34 0
8589934592 2147483648 float sum -1 23563 364.55 683.52 0 23702 362.42 679.54 0
Out of bounds values : 0 OK
Avg bus bandwidth : 165.789
Source code
The source code of the task can be found at examples/clusters/gcp/a3mega-nccl-tests.dstack.yml.
Pass the configuration to dstack apply:
$ dstack apply -f examples/clusters/gcp/a3mega-nccl-tests.dstack.yml
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 131072 float none -1 166.6 50.34 47.19 N/A 164.1 51.11 47.92 N/A
16777216 262144 float none -1 204.6 82.01 76.89 N/A 203.8 82.30 77.16 N/A
33554432 524288 float none -1 284.0 118.17 110.78 N/A 281.7 119.12 111.67 N/A
67108864 1048576 float none -1 447.4 150.00 140.62 N/A 443.5 151.31 141.86 N/A
134217728 2097152 float none -1 808.3 166.05 155.67 N/A 801.9 167.38 156.92 N/A
268435456 4194304 float none -1 1522.1 176.36 165.34 N/A 1518.7 176.76 165.71 N/A
536870912 8388608 float none -1 2892.3 185.62 174.02 N/A 2894.4 185.49 173.89 N/A
1073741824 16777216 float none -1 5532.7 194.07 181.94 N/A 5530.7 194.14 182.01 N/A
2147483648 33554432 float none -1 10863 197.69 185.34 N/A 10837 198.17 185.78 N/A
4294967296 67108864 float none -1 21481 199.94 187.45 N/A 21466 200.08 187.58 N/A
8589934592 134217728 float none -1 42713 201.11 188.54 N/A 42701 201.16 188.59 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 146.948
Source code
The source code of the task can be found at examples/clusters/nccl-tests/.dstack.yml.
Pass the configuration to dstack apply:
$ dstack apply -f examples/clusters/gcp/a3high-nccl-tests.dstack.yml
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 131072 float none -1 784.9 10.69 10.02 0 775.9 10.81 10.14 0
16777216 262144 float none -1 1010.3 16.61 15.57 0 999.3 16.79 15.74 0
33554432 524288 float none -1 1161.6 28.89 27.08 0 1152.9 29.10 27.28 0
67108864 1048576 float none -1 1432.6 46.84 43.92 0 1437.8 46.67 43.76 0
134217728 2097152 float none -1 2516.9 53.33 49.99 0 2491.7 53.87 50.50 0
268435456 4194304 float none -1 5066.8 52.98 49.67 0 5131.4 52.31 49.04 0
536870912 8388608 float none -1 10028 53.54 50.19 0 10149 52.90 49.60 0
1073741824 16777216 float none -1 20431 52.55 49.27 0 20214 53.12 49.80 0
2147483648 33554432 float none -1 40254 53.35 50.01 0 39923 53.79 50.43 0
4294967296 67108864 float none -1 80896 53.09 49.77 0 78875 54.45 51.05 0
8589934592 134217728 float none -1 160505 53.52 50.17 0 160117 53.65 50.29 0
Out of bounds values : 0 OK
Avg bus bandwidth : 40.6043
Source code
The source code of the task can be found at examples/clusters/gcp/a3high-nccl-tests.dstack.yml.
Distributed training¶
You can use the standard distributed task example to run distributed training on A4 instances.
You can use the standard distributed task example to run distributed training on A3 Mega instances. To enable GPUDirect-TCPX, make sure the required NCCL environment variables are properly set, for example by adding the following commands at the beginning:
# ...
commands:
- |
NCCL_LIB_DIR="/var/lib/tcpxo/lib64"
source ${NCCL_LIB_DIR}/nccl-env-profile-ll128.sh
export NCCL_FASTRAK_CTRL_DEV=enp0s12
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices"
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
# ...
You can use the standard distributed task example to run distributed training on A3 High/Edge instances. To enable GPUDirect-TCPX0, make sure the required NCCL environment variables are properly set, for example by adding the following commands at the beginning:
# ...
commands:
- |
export NCCL_DEBUG=INFO
NCCL_LIB_DIR="/usr/local/tcpx/lib64"
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
export NCCL_SOCKET_IFNAME=eth0
export NCCL_CROSS_NIC=0
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=1
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_P2P_PXN_LEVEL=0
export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
export NCCL_DYNAMIC_CHUNK_SIZE=524288
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_P2P_PCI_CHUNKSIZE=524288
export NCCL_P2P_NVL_CHUNKSIZE=1048576
export NCCL_BUFFSIZE=4194304
export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
# ...
In addition to distributed training, you can of course run regular tasks, dev environments, and services.
What's new¶
- Learn about dev environments, tasks, services
- Read the Clusters guide
- Check GCP's docs on using A4, and A3 Mega/High/Edge instances