Skip to content

GCP A3 High clusters

This example shows how to set up a GCP A3 High cluster with GPUDirect-TCPX optimized NCCL communication and run NCCL Tests on it using dstack.

Overview

GCP's A3 High instances are 8xH100 VMs that have 1000Gbps maximum network bandwidth, which is the second best among GCP H100 instances after A3 Mega. To get that network performance, you need to set up GPUDirect-TCPX – the GCP technology for GPU RDMA over TCP. This involves:

  • Setting up four extra data NICs on every node, each NIC in a separate VPC.
  • Configuring a VM image with the GPUDirect-TCPX support.
  • Launching an RXDM service container.
  • Installing the GPUDirect-TCPX NCCL plugin.

dstack hides most of the setup complexity and provides optimized A3 High clusters out-of-the-box.

A3 Edge

This guide also applies to A3 Edge instances.

A3 Mega

A3 Mega instances use GPUDirect-TCPXO, which is an extension of GPUDirect-TCPX. See the A3 Mega guide for more details.

Configure GCP backend

First configure the gcp backend for the GPUDirect-TCPX support. You need to specify at least four extra_vpcs to use for data NICs. You also need to specify vm_service_account that's authorized to pull GPUDirect-related Docker images:

projects:
  - name: main
    backends:
    - type: gcp
      project_id: $MYPROJECT # Replace $MYPROJECT
      extra_vpcs:
        - dstack-gpu-data-net-1
        - dstack-gpu-data-net-2
        - dstack-gpu-data-net-3
        - dstack-gpu-data-net-4
      regions: [europe-west4]
      vm_service_account: a3cluster-sa@$MYPROJECT.iam.gserviceaccount.com # Replace $MYPROJECT
      creds:
        type: default

Custom VPC

If you specify a non-default primary VPC, ensure it has a firewall rule allowing all traffic within the VPC. This is needed for MPI and NCCL to work. The default VPC already permits traffic within the VPC.

Create extra VPCs

Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule:

# Specify the region where you intend to deploy the cluster
REGION="europe-west4"

for N in $(seq 1 4); do
gcloud compute networks create dstack-gpu-data-net-$N \
    --subnet-mode=custom \
    --mtu=8244

gcloud compute networks subnets create dstack-gpu-data-sub-$N \
    --network=dstack-gpu-data-net-$N \
    --region=$REGION \
    --range=192.168.$N.0/24

gcloud compute firewall-rules create dstack-gpu-data-internal-$N \
  --network=dstack-gpu-data-net-$N \
  --action=ALLOW \
  --rules=tcp:0-65535,udp:0-65535,icmp \
  --source-ranges=192.168.0.0/16
done
Create Service Account

Create a VM service account that allows VMs to access the pkg.dev registry:

PROJECT_ID=$(gcloud config get-value project)
gcloud iam service-accounts create a3cluster-sa \
    --display-name "Service Account for pulling GCR images"
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:a3cluster-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
    --role="roles/artifactregistry.reader"

Create A3 High fleet

Once you've configured the gcp backend, create the fleet configuration:

type: fleet
name: a3high-cluster
nodes: 2
placement: cluster
instance_types:
  - a3-highgpu-8g
spot_policy: auto

and apply the configuration:

$ dstack apply -f fleet.dstack.yml
 Project        main                           
 User           admin                          
 Configuration  fleet.dstack.yml               
 Type           fleet                          
 Fleet type     cloud                          
 Nodes          2                              
 Placement      cluster                        
 Resources      2..xCPU, 8GB.., 100GB.. (disk) 
 Spot policy    auto                           

 #  BACKEND  REGION        INSTANCE       RESOURCES           SPOT  PRICE      
 1  gcp      europe-west4  a3-highgpu-8g  208xCPU, 1872GB,    yes   $20.5688   
                                          8xH100 (80GB),                       
                                          100.0GB (disk)                       
 2  gcp      europe-west4  a3-highgpu-8g  208xCPU, 1872GB,    no    $58.5419   
                                          8xH100 (80GB),                       
                                          100.0GB (disk)                       

Fleet a3high-cluster does not exist yet.
Create the fleet? [y/n]: y

Provisioning...
---> 100%                    

dstack will provision two A3 High nodes with GPUDirect-TCPX configured.

Run NCCL Tests with GPUDirect-TCPX support

Once the nodes are provisioned, let's test the network by running NCCL Tests:

$ dstack apply -f examples/misc/a3high-clusters/nccl-tests.dstack.yml 

nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0

                                                              out-of-place                       in-place          
       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        131072     float    none      -1    784.9   10.69   10.02      0    775.9   10.81   10.14      0
    16777216        262144     float    none      -1   1010.3   16.61   15.57      0    999.3   16.79   15.74      0
    33554432        524288     float    none      -1   1161.6   28.89   27.08      0   1152.9   29.10   27.28      0
    67108864       1048576     float    none      -1   1432.6   46.84   43.92      0   1437.8   46.67   43.76      0
   134217728       2097152     float    none      -1   2516.9   53.33   49.99      0   2491.7   53.87   50.50      0
   268435456       4194304     float    none      -1   5066.8   52.98   49.67      0   5131.4   52.31   49.04      0
   536870912       8388608     float    none      -1    10028   53.54   50.19      0    10149   52.90   49.60      0
  1073741824      16777216     float    none      -1    20431   52.55   49.27      0    20214   53.12   49.80      0
  2147483648      33554432     float    none      -1    40254   53.35   50.01      0    39923   53.79   50.43      0
  4294967296      67108864     float    none      -1    80896   53.09   49.77      0    78875   54.45   51.05      0
  8589934592     134217728     float    none      -1   160505   53.52   50.17      0   160117   53.65   50.29      0
Out of bounds values : 0 OK
Avg bus bandwidth    : 40.6043

Done

Run NCCL workloads with GPUDirect-TCPX support

To take full advantage of GPUDirect-TCPX in your workloads, you need properly set up the NCCL environment variables. This can be done with the following commands in your run configuration:

type: task
nodes: 2
commands:
  - |
    export NCCL_DEBUG=INFO
    NCCL_LIB_DIR="/usr/local/tcpx/lib64"
    export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_CROSS_NIC=0
    export NCCL_ALGO=Ring
    export NCCL_PROTO=Simple
    export NCCL_NSOCKS_PERTHREAD=4
    export NCCL_SOCKET_NTHREADS=1
    export NCCL_NET_GDR_LEVEL=PIX
    export NCCL_P2P_PXN_LEVEL=0
    export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
    export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
    export NCCL_DYNAMIC_CHUNK_SIZE=524288
    export NCCL_P2P_NET_CHUNKSIZE=524288
    export NCCL_P2P_PCI_CHUNKSIZE=524288
    export NCCL_P2P_NVL_CHUNKSIZE=1048576
    export NCCL_BUFFSIZE=4194304
    export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
    export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
    export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
    export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
    # run NCCL
resources:
  # Allocate some shared memory for NCCL
  shm_size: 16GB

Future plans

We're working on improving support for A3 High and A3 Edge by pre-building dstack VM image optimized for GPUDirect-TCPX instead of relying on the COS image used now, similar to the dstack support for A3 Mega. This will make configuration easier, reduce provisioning time, and improve performance. We're in contact with GCP on this issue.

Source code

The source code for this example can be found in examples/misc/a3high-clusters .