GCP A4¶

This example shows how to set up a GCP A4 cluster with optimized RoCE networking and run NCCL Tests on it using dstack.

GCP A4 instances provide eight NVIDIA B200 GPUs per VM, each with 180GB memory. These instances also have eight NVIDIA ConnectX-7 (CX-7) NICs that utilize RDMA over Converged Ethernet (RoCE) networking, making them ideal for large-scale distributed deep learning.

Configure the GCP backend¶

First, configure the gcp backend for A4 RoCE support. Specify one VPC in extra_vpcs for general traffic between nodes (in addition to the main VPC), and one VPC in roce_vpcs for GPU-to-GPU communication.

projects:
- name: main
  backends:
  - type: gcp
    project_id: my-project
    creds:
      type: default
    vpc_name: my-vpc-0  # Main VPC (1 subnet, omit to use the default VPC)
    extra_vpcs:
    - my-vpc-1          # Extra VPC (1 subnet)
    roce_vpcs:
    - my-vpc-mrdma      # RoCE VPC (8 subnets, RoCE profile)

RoCE VPC setup

The VPC listed in roce_vpcs must be created with the RoCE profile and have eight subnets (one per GPU). Follow GCP's RoCE setup guide for details.

Firewall rules

Ensure all VPCs allow internal traffic between nodes for MPI/NCCL to function.

Create a fleet¶

Define your fleet configuration:

type: fleet
name: a4-cluster

nodes: 2
placement: cluster

# Specify the zone where you have configured the RoCE VPC
availability_zones: [us-west2-c]
backends: [gcp]
spot_policy: auto

resources:
  gpu: B200:8

Then apply it with dstack apply:

$ dstack apply -f examples/clusters/a4/fleet.dstack.yml

Provisioning...
---> 100%

 FLEET       INSTANCE  BACKEND         GPU                  PRICE    STATUS  CREATED
 a4-cluster  0         gcp (us-west2)  B200:180GB:8 (spot)  $51.552  idle    9 mins ago
             1         gcp (us-west2)  B200:180GB:8 (spot)  $51.552  idle    9 mins ago

dstack will provision the instances and set up ten network interfaces on each instance:

1 regular network interface in the main VPC (vpc_name)
1 regular interface in an extra VPC (extra_vpcs)
8 RoCE-enabled interfaces in a dedicated VPC (roce_vpcs)

Spot instances

Currently, the gcp backend supports only A4 spot instances.

Run NCCL tests¶

To validate networking and GPU performance, you can run NCCL tests:

$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml

Provisioning...
---> 100%

  nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
        size         count      type   redop    root     time   algbw   busbw  wrong     time   algbw   busbw  wrong
         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608       2097152     float     sum      -1    156.9   53.47  100.25      0    167.6   50.06   93.86      0
    16777216       4194304     float     sum      -1    196.3   85.49  160.29      0    206.2   81.37  152.57      0
    33554432       8388608     float     sum      -1    258.5  129.82  243.42      0    261.8  128.18  240.33      0
    67108864      16777216     float     sum      -1    369.4  181.69  340.67      0    371.2  180.79  338.98      0
   134217728      33554432     float     sum      -1    638.5  210.22  394.17      0    587.2  228.57  428.56      0
   268435456      67108864     float     sum      -1    940.3  285.49  535.29      0    950.7  282.36  529.43      0
   536870912     134217728     float     sum      -1   1695.2  316.70  593.81      0   1666.9  322.08  603.89      0
  1073741824     268435456     float     sum      -1   3229.9  332.44  623.33      0   3201.8  335.35  628.78      0
  2147483648     536870912     float     sum      -1   6107.7  351.61  659.26      0   6157.1  348.78  653.97      0
  4294967296    1073741824     float     sum      -1    11952  359.36  673.79      0    11942  359.65  674.34      0
  8589934592    2147483648     float     sum      -1    23563  364.55  683.52      0    23702  362.42  679.54      0
  Out of bounds values : 0 OK
  Avg bus bandwidth    : 165.789

What's next

Learn more about distributed tasks
Check dev environments, services, and fleets
Read the Clusters guide