Skip to content

Axolotl

This example walks you through how to run distributed fine-tune using Axolotl with dstack.

Prerequisites

Once dstack is installed, go ahead clone the repo, and run dstack init.

$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init

Create a fleet

Before submitting distributed training runs, make sure to create a fleet with a placement set to cluster.

For more detials on how to use clusters with dstack, check the Clusters guide.

Define a configuration

Once the fleet is created, define a distributed task configuration. Here's an example of distributed QLORA task using FSDP.

type: task
name: axolotl-multi-node-qlora-llama3-70b

nodes: 2

image: nvcr.io/nvidia/pytorch:25.01-py3

env:
  - HF_TOKEN
  - WANDB_API_KEY
  - WANDB_PROJECT
  - HUB_MODEL_ID
  - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  - NCCL_DEBUG=INFO
  - ACCELERATE_LOG_LEVEL=info

commands:
  # Replacing the default Torch and FlashAttention in the NCG container with Axolotl-compatible versions.
  # The preinstalled versions are incompatible with Axolotl.
  - pip uninstall -y torch flash-attn
  - pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
  - pip install --no-build-isolation axolotl[flash-attn,deepspeed]
  - wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
  # Axolotl includes hf-xet version 1.1.0, which fails during downloads. Replacing it with the latest version (1.1.2).
  - pip uninstall -y hf-xet
  - pip install hf-xet --no-cache-dir
  - |
    accelerate launch \
      --config_file=fsdp1.yaml \
      -m axolotl.cli.train qlora-fsdp-70b.yaml \
      --hub-model-id $HUB_MODEL_ID \
      --output-dir /checkpoints/qlora-llama3-70b \
      --wandb-project $WANDB_PROJECT \
      --wandb-name $DSTACK_RUN_NAME \
      --main_process_ip=$DSTACK_MASTER_NODE_IP \
      --main_process_port=8008 \
      --machine_rank=$DSTACK_NODE_RANK \
      --num_processes=$DSTACK_GPUS_NUM \
      --num_machines=$DSTACK_NODES_NUM

resources:
  gpu: 80GB:8
  shm_size: 128GB

volumes:
  - /checkpoints:/checkpoints

Docker image

We are using nvcr.io/nvidia/pytorch:25.01-py3 from NGC because it includes the necessary libraries and packages for RDMA and InfiniBand support.

Apply the configuration

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ WANDB_PROJECT=...
$ HUB_MODEL_ID=...
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml

 #  BACKEND       RESOURCES                       INSTANCE TYPE  PRICE       
 1  ssh (remote)  cpu=208 mem=1772GB H100:80GB:8  instance       $0     idle 
 2  ssh (remote)  cpu=208 mem=1772GB H100:80GB:8  instance       $0     idle  

Submit the run trl-train-fsdp-distrib? [y/n]: y

Provisioning...
---> 100%

Source code

The source-code of this example can be found in examples/distributed-training/axolotl .

What's next?

  1. Read the clusters guide
  2. Check dev environments, tasks, services, and fleets