Skip to content

AWS EFA

In this guide, we’ll walk through how to run high-performance distributed training on AWS using Amazon Elastic Fabric Adapter (EFA) with dstack.

Overview

EFA is a network interface for Amazon EC2 that enables low-latency, high-bandwidth inter-node communication — essential for scaling distributed deep learning. With dstack, EFA is automatically enabled when you create fleets with supported instance types.

Prerequisite

Before you start, make sure the aws backend is properly configured.

projects:
- name: main
  backends:
    - type: aws
      creds:
        type: default
      regions: ["us-west-2"]

      public_ips: false
      vpc_name: my-custom-vpc

Multiple network interfaces

To use P4, P5, or P6 instances, set public_ips to false — this allows AWS to attach multiple network interfaces for EFA. In this case, the dstack server can reach your VPC’s private subnets.

VPC

If you use a custom VPC, verify that it permits all internal traffic between nodes for EFA to function properly

Create a fleet

Once your backend is ready, define a fleet configuration.

type: fleet
name: my-efa-fleet

nodes: 2
placement: cluster

resources:
  gpu: H100:8

Provision the fleet with dstack apply:

$ dstack apply -f examples/clusters/efa/fleet.dstack.yml

Provisioning...
---> 100%

 FLEET         INSTANCE  BACKEND          INSTANCE TYPE  GPU          PRICE   STATUS  CREATED 
 my-efa-fleet  0         aws (us-west-2)  p4d.24xlarge   H100:8:80GB  $98.32  idle    3 mins ago      
               1         aws (us-west-2)  p4d.24xlarge   $98.32  idle    3 mins ago    
Instance types

dstack selects suitable instances automatically, but not all types support EFA . To enforce EFA, you can specify instance_types explicitly:

type: fleet
name: my-efa-fleet

nodes: 2
placement: cluster

resources:
  gpu: L4

instance_types: ["g6.8xlarge"] # If not specified, g6.xlarge is used (won't have EFA)

Run NCCL tests

To confirm that EFA is working, run NCCL tests:

type: task
name: nccl-tests

nodes: 2

startup_order: workers-first
stop_criteria: master-done

env:
  - NCCL_DEBUG=INFO
commands:
  - |
    if [ $DSTACK_NODE_RANK -eq 0 ]; then
      mpirun \
        --allow-run-as-root \
        --hostfile $DSTACK_MPI_HOSTFILE \
        -n $DSTACK_GPUS_NUM \
        -N $DSTACK_GPUS_PER_NODE \
        --bind-to none \
        all_reduce_perf -b 8 -e 8G -f 2 -g 1
    else
      sleep infinity
    fi

resources:
  gpu: 1..8
  shm_size: 16GB

Run it with dstack apply:

$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml

Provisioning...
---> 100%

Docker image

You can use your own container by setting image. If omitted, dstack uses its default image with drivers, NCCL tests, and tools pre-installed.

Run distributed training

Here’s an example using torchrun for a simple multi-node PyTorch job:

type: task
name: train-distrib

nodes: 2

python: 3.12
env:
  - NCCL_DEBUG=INFO
commands:
  - git clone https://github.com/pytorch/examples.git pytorch-examples
  - cd pytorch-examples/distributed/ddp-tutorial-series
  - uv pip install -r requirements.txt
  - |
    torchrun \
      --nproc-per-node=$DSTACK_GPUS_PER_NODE \
      --node-rank=$DSTACK_NODE_RANK \
      --nnodes=$DSTACK_NODES_NUM \
      --master-addr=$DSTACK_MASTER_NODE_IP \
      --master-port=12345 \
      multinode.py 50 10

resources:
  gpu: 1..8
  shm_size: 16GB

Provision and launch it via dstack apply.

$ dstack apply -f examples/distributed-training/torchrun/.dstack.yml

Provisioning...
---> 100%

Instead of setting python, you can specify your own Docker image using image. Make sure that the image is properly configured for EFA.

What's next

  1. Learn more about distributed tasks
  2. Check dev environments, services, and fleets
  3. Read the Clusters guide