`.
Each component is optional.
Ranges can be:
* **Closed** (e.g. `24GB..80GB` or `1..8`)
* **Open** (e.g. `24GB..` or `1..`)
* **Single values** (e.g. `1` or `24GB`).
Examples:
- `1` (any GPU)
- `amd:2` (two AMD GPUs)
- `A100` (A100)
- `24GB..` (any GPU starting from 24GB)
- `24GB..40GB:2` (two GPUs between 24GB and 40GB)
- `A10G,A100` (either A10G or A100)
- `A100:80GB` (one A100 of 80GB)
- `A100:2` (two A100)
- `MI300X:4` (four MI300X)
- `A100:40GB:2` (two A100 40GB)
- `tpu:v2-8` (`v2` Google Cloud TPU with 8 cores)
The GPU vendor is indicated by one of the following case-insensitive values:
- `nvidia` (NVIDIA GPUs)
- `amd` (AMD GPUs)
- `tpu` (Google Cloud TPUs)
??? info "AMD"
Currently, when an AMD GPU is specified, either by name or by vendor, the `image` property must be specified as well.
??? info "TPU"
Currently, you can't specify other than 8 TPU cores. This means only single host workloads are supported.
Support for multiple hosts is coming soon.
## Offers
If you're not sure which offers (hardware configurations) are available with the configured backends, use the
[`dstack offer`](../reference/cli/dstack/offer.md#list-gpu-offers) command.
```shell
$ dstack offer --gpu H100 --max-offers 10
Getting offers...
---> 100%
# BACKEND REGION INSTANCE TYPE RESOURCES SPOT PRICE
1 verda FIN-01 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
2 verda FIN-02 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
3 verda FIN-02 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
4 verda ICE-01 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
5 runpod US-KS-2 NVIDIA H100 PCIe 16xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.39
6 runpod CA NVIDIA H100 80GB HBM3 24xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.69
7 nebius eu-north1 gpu-h100-sxm 16xCPU, 200GB, 1xH100 (80GB), 100.0GB (disk) no $2.95
8 runpod AP-JP-1 NVIDIA H100 80GB HBM3 20xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.99
9 runpod CA-MTL-1 NVIDIA H100 80GB HBM3 28xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.99
10 runpod CA-MTL-2 NVIDIA H100 80GB HBM3 26xCPU, 125GB, 1xH100 (80GB), 100.0GB (disk) no $2.99
...
Shown 10 of 99 offers, $127.816 max
```
By default, `dstack offer` ignores fleet configurations and shows all available offers that match the request.
To inspect offers available through a specific fleet, pass `--fleet NAME`.
??? info "Grouping offers"
Use `--group-by` to aggregate offers. Accepted values: `gpu`, `backend`, `region`, and `count`.
```shell
dstack offer --gpu b200 --group-by gpu,backend,region
Project main
User admin
Resources cpu=2.. mem=8GB.. disk=100GB.. b200:1..
Spot policy auto
Max price -
Reservation -
Group by gpu, backend, region
# GPU SPOT $/GPU BACKEND REGION
1 B200:180GB:1..8 spot, on-demand 3.59..5.99 runpod EU-RO-1
2 B200:180GB:1..8 spot, on-demand 3.59..5.99 runpod US-CA-2
3 B200:180GB:8 on-demand 4.99 lambda us-east-1
4 B200:180GB:8 on-demand 5.5 nebius us-central1
```
When using `--group-by`, `gpu` must always be `included`.
The `region` value can only be used together with `backend`.
The `offer` command allows you to filter and group offers with various [advanced options](../reference/cli/dstack/offer.md#usage).
## Metrics
`dstack` tracks essential metrics accessible via the CLI and UI. To access advanced metrics like DCGM, configure the server to export metrics to Prometheus. See [Metrics](../concepts/metrics.md) for details.
## Service quotas
If you're using your own AWS, GCP, Azure, or OCI accounts, before you can use GPUs or spot instances, you have to request the
corresponding service quotas for each type of instance in each region.
??? info "AWS"
Check this [guide ](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) on EC2 service quotas.
The relevant service quotas include:
- `Running On-Demand P instances` (on-demand V100, A100 80GB x8)
- `All P4, P3 and P2 Spot Instance Requests` (spot V100, A100 80GB x8)
- `Running On-Demand G and VT instances` (on-demand T4, A10G, L4)
- `All G and VT Spot Instance Requests` (spot T4, A10G, L4)
- `Running Dedicated p5 Hosts` (on-demand H100)
- `All P5 Spot Instance Requests` (spot H100)
??? info "GCP"
Check this [guide ](https://cloud.google.com/compute/resource-usage) on Compute Engine service quotas.
The relevant service quotas include:
- `NVIDIA V100 GPUs` (on-demand V100)
- `Preemtible V100 GPUs` (spot V100)
- `NVIDIA T4 GPUs` (on-demand T4)
- `Preemtible T4 GPUs` (spot T4)
- `NVIDIA L4 GPUs` (on-demand L4)
- `Preemtible L4 GPUs` (spot L4)
- `NVIDIA A100 GPUs` (on-demand A100)
- `Preemtible A100 GPUs` (spot A100)
- `NVIDIA A100 80GB GPUs` (on-demand A100 80GB)
- `Preemtible A100 80GB GPUs` (spot A100 80GB)
- `NVIDIA H100 GPUs` (on-demand H100)
- `Preemtible H100 GPUs` (spot H100)
??? info "Azure"
Check this [guide ](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) on Azure service quotas.
The relevant service quotas include:
- `Total Regional Spot vCPUs` (any spot instances)
- `Standard NCASv3_T4 Family vCPUs` (on-demand T4)
- `Standard NVADSA10v5 Family vCPUs` (on-demand A10)
- `Standard NCADS_A100_v4 Family vCPUs` (on-demand A100 80GB)
- `Standard NDASv4_A100 Family vCPUs` (on-demand A100 40GB x8)
- `Standard NDAMSv4_A100Family vCPUs` (on-demand A100 80GB x8)
- `Standard NCadsH100v5 Family vCPUs` (on-demand H100)
- `Standard NDSH100v5 Family vCPUs` (on-demand H100 x8)
??? info "OCI"
Check this [guide ](https://docs.oracle.com/en-us/iaas/Content/General/Concepts/servicelimits.htm#Requesti) on requesting OCI service limits increase.
The relevant service category is compute. The relevant resources include:
- `GPUs for GPU.A10 based VM and BM instances` (on-demand A10)
- `GPUs for GPU2 based VM and BM instances` (on-demand P100)
- `GPUs for GPU3 based VM and BM instances` (on-demand V100)
Note, for AWS, GCP, and Azure, service quota values are measured with the number of CPUs rather than GPUs.
[//]: # (TODO: Mention spot policy)
# docs/guides/upgrade.md
---
title: Upgrade
description: Upgrading to newer versions of dstack
---
# Upgrade guide
## 0.20.* { #0_20 }
### CLI compatibility
- CLI versions `0.19.*` and earlier remain backward compatible with the `0.20.*` `dstack` server.
- CLI versions `0.20.` are not compatible with server versions prior to `0.20.*`.
> Do not upgrade the CLI to `0.20.*` until the server has been upgraded.
### Fleets
* Prior to `0.20`, `dstack` automatically provisioned a fleet if one did not exist at run time.
Beginning with `0.20`, `dstack` will only use existing fleets.
> Create fleets before submitting runs. To enable on-demand instance provisioning, configure `nodes` as a range in the [backend fleet](../concepts/fleets.md#backend-fleets) configuration.
### Working directory
- Previously, when `working_dir` was not specified, `dstack` defaulted to `/workflow`. As of `0.20`, `dstack` uses the working directory defined in the Docker image. If the image does not define a working directory, `dstack` falls back to `/`.
- The default image introduced in `0.20` uses `/dstack/run` as its default working directory.
> To override the directory defined in the Docker image, specify [`working_dir`](../concepts/dev-environments.md#working-directory) explicitly.
### Repo directory
- Previously, if no [repo directory](../concepts/dev-environments.md#repos) was specified, `dstack` cloned the repository into `/workflow`. With `0.20`, the working directory becomes the default repo directory.
- In earlier versions, cloning was skipped if the repo directory was non-empty. Starting with `0.20`, this results in a `runner error` unless `if_exists` is set to `skip` in the repo configuration.
> Ensure repo directories are empty, or explicitly set `if_exists` to `skip`.
### Deprecated feature removal
The following deprecated commands have been removed in **0.20**:
- `dstack config`
- `dstack stats`
- `dstack gateway create`
Use the corresponding replacements:
- `dstack project`
- `dstack metrics`
- `dstack apply`
> For more details on the changes, see the [release notes](https://github.com/dstackai/dstack/releases).
# docs/guides/migration/slurm.md
---
title: Migrate from Slurm
description: This guide compares Slurm and dstack, and shows how to orchestrate equivalent GPU-based workloads using dstack.
---
# Migrate from Slurm
Both Slurm and `dstack` are open-source workload orchestration systems designed to manage compute resources and schedule jobs. This guide compares Slurm and `dstack`, maps features between the two systems, and shows their `dstack` equivalents.
!!! tip "Slurm vs dstack"
Slurm is a battle-tested system with decades of production use in HPC environments. `dstack` is designed for modern ML/AI workloads with cloud-native provisioning and container-first architecture. Slurm is better suited for traditional HPC centers with static clusters; `dstack` is better suited for cloud-native ML teams working with cloud GPUs. Both systems can handle distributed training and batch workloads.
| | Slurm | dstack |
|---|-------|--------|
| **Provisioning** | Pre-configured static clusters; cloud requires third-party integrations with potential limitations | Native integration with top GPU clouds; automatically provisions clusters on demand |
| **Containers** | Optional via plugins | Built around containers from the ground up |
| **Use cases** | Batch job scheduling and distributed training | Interactive development, distributed training, and production inference services |
| **Personas** | HPC centers, academic institutions, research labs | ML engineering teams, AI startups, cloud-native organizations |
While `dstack` is designed to be use-case agnostic and supports both development and production-grade inference, this guide focuses specifically on training workloads.
## Architecture
Both Slurm and `dstack` follow a client-server architecture with a control plane and a compute plane running on cluster instances.
| | Slurm | dstack |
|---|---------------|-------------------|
| **Control plane** | `slurmctld` (controller) | `dstack-server` |
| **State persistence** | `slurmdbd` (database) | `dstack-server` (SQLite/PostgreSQL) |
| **API** | `slurmrestd` (REST API) | `dstack-server` (HTTP API) |
| **Compute plane** | `slurmd` (compute agent) | `dstack-shim` (on VMs/hosts) and/or `dstack-runner` (inside containers) |
| **Client** | CLI from login nodes | CLI from anywhere |
| **High availability** | Active-passive failover (typically 2 controller nodes) | Horizontal scaling with multiple server replicas (requires PostgreSQL) |
## Job configuration and submission
Both Slurm and `dstack` allow defining jobs as files and submitting them via CLI.
### Slurm
Slurm uses shell scripts with `#SBATCH` directives embedded in the script:
```bash
#!/bin/bash
#SBATCH --job-name=train-model
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=2:00:00
#SBATCH --partition=gpu
#SBATCH --output=train-%j.out
#SBATCH --error=train-%j.err
export HF_TOKEN
export LEARNING_RATE=0.001
module load python/3.9
srun python train.py --batch-size=64
```
Submit the job from a login node (with environment variables that override script defaults):
```shell
$ sbatch --export=ALL,LEARNING_RATE=0.002 train.sh
Submitted batch job 12346
```
### dstack
`dstack` uses declarative YAML configuration files:
```yaml
type: task
name: train-model
python: 3.9
repos:
- .
env:
- HF_TOKEN
- LEARNING_RATE=0.001
commands:
- python train.py --batch-size=64
resources:
gpu: 1
memory: 32GB
cpu: 8
shm_size: 8GB
max_duration: 2h
```
Submit the job from anywhere (laptop, CI/CD) via the CLI. `dstack apply` allows overriding various options and runs in attached mode by default, streaming job output in real-time:
```shell
$ dstack apply -f .dstack.yml --env LEARNING_RATE=0.002
# BACKEND REGION RESOURCES SPOT PRICE
1 aws us-east-1 4xCPU, 16GB, T4:1 yes $0.10
Submit the run train-model? [y/n]: y
Launching `train-model`...
---> 100%
```
### Configuration comparison
| | Slurm | dstack |
|---|-------|--------|
| **File type** | Shell script with `#SBATCH` directives | YAML configuration file (`.dstack.yml`) |
| **GPU** | `--gres=gpu:N` or `--gres=gpu:type:N` | `gpu: A100:80GB:4` or `gpu: 40GB..80GB:2..8` (supports ranges) |
| **Memory** | `--mem=M` (per node) or `--mem-per-cpu=M` | `memory: 200GB..` (range, per node, minimum requirement) |
| **CPU** | `--cpus-per-task=C` or `--ntasks` | `cpu: 32` (per node) |
| **Shared memory** | Configured on host | `shm_size: 24GB` (explicit) |
| **Duration** | `--time=2:00:00` | `max_duration: 2h` (both enforce walltime) |
| **Cluster** | `--partition=gpu` | `fleets: [gpu]` (see Partitions and fleets below) |
| **Output** | `--output=train-%j.out` (writes files) | `dstack logs` or UI (streams via API) |
| **Working directory** | `--chdir=/path/to/dir` or defaults to submission directory | `working_dir: /path/to/dir` (defaults to image's working directory, typically `/dstack/run`) |
| **Environment variables** | `export VAR` or `--export=ALL,VAR=value` | `env: - VAR` or `--env VAR=value` |
| **Node exclusivity** | `--exclusive` (entire node) | Automatic if `blocks` is not used or job uses all blocks; required for distributed tasks (`nodes` > 1) |
> For multi-node examples, see [Distributed training](#distributed-training) below.
## Containers
### Slurm
By default, Slurm runs jobs on compute nodes using the host OS with cgroups for resource isolation and full access to the host filesystem. Container execution is optional via plugins but require explicit filesystem mounts.
=== "Singularity/Apptainer"
Container image must exist on shared filesystem. Mount host directories with `--container-mounts`:
```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=2:00:00
srun --container-image=/shared/images/pytorch-2.0-cuda11.8.sif \
--container-mounts=/shared/datasets:/datasets,/shared/checkpoints:/checkpoints \
python train.py --batch-size=64
```
=== "Pyxis with Enroot"
Pyxis plugin pulls images from Docker registry. Mount host directories with `--container-mounts`:
```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=2:00:00
srun --container-image=pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime \
--container-mounts=/shared/datasets:/datasets,/shared/checkpoints:/checkpoints \
python train.py --batch-size=64
```
=== "Enroot"
Pulls images from registry. Mount host directories with `--container-mounts`:
```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=2:00:00
srun --container-image=docker://pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime \
--container-mounts=/shared/datasets:/datasets,/shared/checkpoints:/checkpoints \
python train.py --batch-size=64
```
### dstack
`dstack` always uses container. If `image` is not specified, `dstack` uses a base Docker image with `uv`, `python`, essential CUDA drivers, and other dependencies. You can also specify your own Docker image:
=== "Public registry"
```yaml
type: task
name: train-with-image
image: pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime
repos:
- .
commands:
- python train.py --batch-size=64
resources:
gpu: 1
memory: 32GB
```
=== "Private registry"
```yaml
type: task
name: train-ngc
image: nvcr.io/nvidia/pytorch:24.01-py3
registry_auth:
username: $oauthtoken
password: ${{ secrets.nvidia_ngc_api_key }}
repos:
- .
commands:
- python train.py --batch-size=64
resources:
gpu: 1
memory: 32GB
```
`dstack` can automatically upload files via `repos` or `files`, or mount filesystems via `volumes`. See [Filesystems and data access](#filesystems-and-data-access) below.
## Distributed training
Both Slurm and `dstack` schedule distributed workloads over clusters with fast interconnect, automatically propagating environment variables required by distributed frameworks (PyTorch DDP, DeepSpeed, FSDP, etc.).
### Slurm
Slurm explicitly controls both `nodes` and processes/tasks.
=== "PyTorch DDP"
```bash
#!/bin/bash
#SBATCH --job-name=distributed-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1 # One task per node
#SBATCH --gres=gpu:8 # 8 GPUs per node
#SBATCH --mem=200G
#SBATCH --time=24:00:00
#SBATCH --partition=gpu
# Set up distributed training environment
MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
MASTER_PORT=12345
export MASTER_ADDR MASTER_PORT
# Launch training with torchrun (torch.distributed.launch is deprecated)
srun torchrun \
--nnodes="$SLURM_JOB_NUM_NODES" \
--nproc_per_node=8 \
--node_rank="$SLURM_NODEID" \
--rdzv_backend=c10d \
--rdzv_endpoint="$MASTER_ADDR:$MASTER_PORT" \
train.py \
--model llama-7b \
--batch-size=32 \
--epochs=10
```
=== "MPI"
```bash
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --gres=gpu:8
#SBATCH --mem=200G
#SBATCH --time=24:00:00
export MASTER_ADDR=$(scontrol show hostnames $SLURM_NODELIST | head -n1)
export MASTER_PORT=12345
# Convert SLURM_JOB_NODELIST to hostfile format
HOSTFILE=$(mktemp)
scontrol show hostnames $SLURM_JOB_NODELIST | awk -v slots=$SLURM_NTASKS_PER_NODE '{print $0" slots="slots}' > $HOSTFILE
# MPI with NCCL tests or custom MPI application
mpirun \
--allow-run-as-root \
--hostfile $HOSTFILE \
-n $SLURM_NTASKS \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
rm -f $HOSTFILE
```
### dstack
`dstack` only specifies `nodes`. A run with multiple nodes creates multiple jobs (one per node), each running in a container on a particular instance. Inside the job container, processes are determined by the user's `commands`.
=== "PyTorch DDP"
```yaml
type: task
name: distributed-train-pytorch
nodes: 4
python: 3.12
repos:
- .
env:
- NCCL_DEBUG=INFO
- NCCL_IB_DISABLE=0
- NCCL_SOCKET_IFNAME=eth0
commands:
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
train.py \
--model llama-7b \
--batch-size=32 \
--epochs=10
resources:
gpu: A100:80GB:8
memory: 200GB..
shm_size: 24GB
max_duration: 24h
```
=== "MPI"
For MPI workloads that require specific job startup and termination behavior, `dstack` provides `startup_order` and `stop_criteria` properties. The master node (rank 0) runs the MPI command, while worker nodes wait for the master to complete.
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
resources:
gpu: nvidia:1..8
shm_size: 16GB
```
If `startup_order` and `stop_criteria` are not configured (as in the PyTorch DDP example above), the master worker starts first and waits until all workers terminate. For MPI workloads, we need to change this.
#### Nodes and processes comparison
| | Slurm | dstack |
|---|-------|--------|
| **Nodes** | `--nodes=4` | `nodes: 4` |
| **Processes/tasks** | `--ntasks=8` or `--ntasks-per-node=2` (controls process distribution) | Determined by `commands` (relies on frameworks like `torchrun`, `accelerate`, `mpirun`, etc.) |
**Environment variables comparison:**
| Slurm | dstack | Purpose |
|-------|--------|---------|
| `SLURM_NODELIST` | `DSTACK_NODES_IPS` | Newline-delimited list of node IPs |
| `SLURM_NODEID` | `DSTACK_NODE_RANK` | Node rank (0-based) |
| `SLURM_PROCID` | N/A | Process rank (0-based, across all processes) |
| `SLURM_NTASKS` | `DSTACK_GPUS_NUM` | Total number of processes/GPUs |
| `SLURM_NTASKS_PER_NODE` | `DSTACK_GPUS_PER_NODE` | Number of processes/GPUs per node |
| `SLURM_JOB_NUM_NODES` | `DSTACK_NODES_NUM` | Number of nodes |
| Manual master address | `DSTACK_MASTER_NODE_IP` | Master node IP (automatically set) |
| N/A | `DSTACK_MPI_HOSTFILE` | Pre-populated MPI hostfile |
!!! info "Fleets"
Distributed tasks may run only on a fleet with `placement: cluster` configured. Refer to [Partitions and fleets](#partitions-and-fleets) for configuration details.
## Queueing and scheduling
Both systems support core scheduling features and efficient resource utilization.
| | Slurm | dstack |
|---------|-------|--------|
| **Prioritization** | Multi-factor system (fairshare, age, QOS); influenced via `--qos` or `--partition` flags | Set via `priority` (0-100); plus FIFO within the same priority |
| **Queueing** | Automatic via `sbatch`; managed through partitions | Set `on_events` to `[no-capacity]` under `retry` configuration |
| **Usage quotas** | Set via `sacctmgr` command per user/account/QOS | Not supported |
| **Backfill scheduling** | Enabled via `SchedulerType=sched/backfill` in `slurm.conf` | Not supported |
| **Preemption** | Configured via `PreemptType` in `slurm.conf` (QOS or partition-based) | Not supported |
| **Topology-aware scheduling** | Configured via `topology.conf` (InfiniBand switches, interconnects) | Not supported |
### Slurm
Slurm may use a multi-factor priority system, and limit usage across accounts, users, and runs.
#### QOS
Quality of Service (QOS) provides a static priority boost. Administrators create QOS levels and assign them to users as defaults:
```shell
$ sacctmgr add qos high_priority Priority=1000
$ sacctmgr modify qos high_priority set MaxWall=200:00:00 MaxTRES=gres/gpu=8
```
Users can override the default QOS when submitting jobs via CLI (`sbatch --qos=high_priority`) or in the job script:
```bash
#!/bin/bash
#SBATCH --qos=high_priority
```
#### Accounts and usage quotas
Usage quotas limit resource consumption and can be set per user, account, or QOS:
```shell
$ sacctmgr add account research
$ sacctmgr modify user user1 set account=research
$ sacctmgr modify user user1 set MaxWall=100:00:00 MaxTRES=gres/gpu=4
$ sacctmgr modify account research set MaxWall=1000:00:00 MaxTRES=gres/gpu=16
```
#### Monitoring commands
Slurm provides several CLI commands to check queue status, job details, and quota usage:
=== "Queue status"
Use `squeue` to check queue status. Jobs are listed in scheduling order by priority:
```shell
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES REASON
12345 gpu training user1 PD 0:00 2 Priority
```
=== "Job details"
Use `scontrol show job` to show detailed information about a specific job:
```shell
$ scontrol show job 12345
JobId=12345 JobName=training
UserId=user1(1001) GroupId=users(100)
Priority=4294 Reason=Priority (Resources)
```
=== "Quota usage"
The `sacct` command can show quota consumption per user, account, or QOS depending on the format options:
```shell
$ sacct -S 2024-01-01 -E 2024-01-31 --format=User,Account,TotalCPU,TotalTRES
User Account TotalCPU TotalTRES
user1 research 100:00:00 gres/gpu=50
```
#### Topology-aware scheduling
Slurm detects network topology (InfiniBand switches, interconnects) and optimizes multi-node job placement to minimize latency. Configured in `topology.conf`, referenced from `slurm.conf`:
```bash
SwitchName=switch1 Nodes=node[01-10]
SwitchName=switch2 Nodes=node[11-20]
```
When scheduling multi-node jobs, Slurm prioritizes nodes connected to the same switch to minimize network latency.
### dstack
`dstack` doesn't have the concept of accounts, QOS, and doesn't support usage quotas yet.
#### Priority and retry policy
However, `dstack` supports prioritization (integer, no multi-factor or pre-emption) and queueing jobs.
```yaml
type: task
name: train-with-retry
python: 3.12
repos:
- .
commands:
- python train.py --batch-size=64
resources:
gpu: 1
memory: 32GB
# Priority: 0-100 (FIFO within same level; default: 0)
priority: 50
retry:
on_events: [no-capacity] # Retry until idle instances are available (enables queueing similar to Slurm)
duration: 48h # Maximum retry time (run age for no-capacity, time since last event for error/interruption)
max_duration: 2h
```
By default, the `retry` policy is not set, which means run fails immediately if no capacity is available.
#### Scheduled runs
Unlike Slurm, `dstack` supports scheduled runs using the `schedule` property with cron syntax, allowing tasks to start periodically at specific UTC times.
```yaml
type: task
name: task-with-cron
python: 3.12
repos:
- .
commands:
- python task.py --batch-size=64
resources:
gpu: 1
memory: 32GB
schedule:
cron: "15 23 * * *" # everyday at 23:15 UTC
```
#### Monitoring commands
=== "Queue status"
The `dstack ps` command displays runs and jobs sorted by priority, reflecting the order in which they will be scheduled.
```shell
$ dstack ps
NAME BACKEND RESOURCES PRICE STATUS SUBMITTED
training-job aws H100:1 (spot) $4.50 provisioning 2 mins ago
```
#### Topology-aware scheduling
Topology-aware scheduling is not supported in `dstack`. While backend provisioning may respect network topology (e.g., cloud providers may provision instances with optimal inter-node connectivity), `dstack` task scheduling does not leverage topology-aware placement.
## Partitions and fleets
Partitions in Slurm and fleets in `dstack` both organize compute nodes for job scheduling. The key difference is that `dstack` fleets natively support dynamic cloud provisioning, whereas Slurm partitions organize pre-configured static nodes.
| | Slurm | dstack |
|---|-------|--------|
| **Provisioning** | Static nodes only | Supports both static clusters (SSH fleets) and dynamic provisioning via backends (cloud or Kubernetes) |
| **Overlap** | Nodes can belong to multiple partitions | Each instance belongs to exactly one fleet |
| **Accounts and projects** | Multiple accounts can use the same partition; used for quotas and resource accounting | Each fleet belongs to one project |
### Slurm
Slurm partitions are logical groupings of static nodes defined in `slurm.conf`. Nodes can belong to multiple partitions:
```bash
PartitionName=gpu Nodes=gpu-node[01-10] Default=NO MaxTime=24:00:00
PartitionName=cpu Nodes=cpu-node[01-50] Default=YES MaxTime=72:00:00
PartitionName=debug Nodes=gpu-node[01-10] Default=NO MaxTime=1:00:00
```
Submit to a specific partition:
```shell
$ sbatch --partition=gpu train.sh
Submitted batch job 12346
```
### dstack
`dstack` fleets are pools of instances (VMs or containers) that serve as both the organization unit and the provisioning template.
`dstack` supports two types of fleets:
| Fleet type | Description |
|------------|-------------|
| **Backend fleets** | Dynamically provisioned via configured backends (cloud or Kubernetes). Specify `resources` and `nodes` range; `dstack apply` provisions matching instances/clusters automatically. |
| **SSH fleets** | Use existing on-premises servers/clusters via `ssh_config`. `dstack apply` connects via SSH, installs dependencies. |
=== "Backend fleets"
```yaml
type: fleet
name: gpu-fleet
nodes: 0..8
resources:
gpu: A100:80GB:8
# Optional: Enables inter-node connectivity; required for distributed tasks
placement: cluster
# Optional: Split GPUs into blocks for multi-tenant sharing
# Optional: Allows to share the instance across up to 8 workloads
blocks: 8
backends: [aws]
# Spot instances for cost savings
spot_policy: auto
```
=== "SSH fleets"
```yaml
type: fleet
name: on-prem-gpu-fleet
# Optional: Enables inter-node connectivity; required for distributed tasks
placement: cluster
# Optional: Allows to share the instance across up to 8 workloads
blocks: 8
ssh_config:
user: dstack
identity_file: ~/.ssh/id_rsa
hosts:
- gpu-node01.example.com
- gpu-node02.example.com
# Optional: Only required if hosts are behind a login node (bastion host)
proxy_jump:
hostname: login-node.example.com
user: dstack
identity_file: ~/.ssh/login_node_key
```
Tasks with multiple nodes require a fleet with `placement: cluster` configured, otherwise they cannot run.
Submit to a specific fleet:
```shell
$ dstack apply -f train.dstack.yml --fleet gpu-fleet
BACKEND REGION RESOURCES SPOT PRICE
1 aws us-east-1 4xCPU, 16GB, T4:1 yes $0.10
Submit the run train-model? [y/n]: y
Launching `train-model`...
---> 100%
```
Create or update a fleet:
```shell
$ dstack apply -f fleet.dstack.yml
Provisioning...
---> 100%
```
List fleets:
```shell
$ dstack fleet
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
gpu-fleet 0 aws (us-east-1) A100:80GB (spot) $0.50 idle 3 mins ago
```
## Filesystems and data access
Both Slurm and `dstack` allow workloads to access filesystems (including shared filesystems) and copy files.
| | Slurm | dstack |
|---|-------|--------|
| **Host filesystem access** | Full access by default (native processes); mounting required only for containers | Always uses containers; requires explicit mounting via `volumes` (instance or network) |
| **Shared filesystems** | Assumes global namespace (NFS, Lustre, GPFS); same path exists on all nodes | Supported via SSH fleets with instance volumes (pre-mounted network storage); network volumes for backend fleets (limited support for shared filesystems) |
| **Instance disk size** | Fixed by cluster administrator | Configurable via `disk` property in `resources` (tasks) or fleet configuration; supports ranges (e.g., `disk: 500GB` or `disk: 200GB..1TB`) |
| **Local/temporary storage** | `$SLURM_TMPDIR` (auto-cleaned on job completion) | Container filesystem (auto-cleaned on job completion; except instance volumes or network volumes) |
| **File transfer** | `sbcast` for broadcasting files to allocated nodes | `repos` and `files` properties; `rsync`/`scp` via SSH (when attached) |
### Slurm
Slurm assumes a shared filesystem (NFS, Lustre, GPFS) with a global namespace. The same path exists on all nodes, and `$SLURM_TMPDIR` provides local scratch space that is automatically cleaned.
=== "Native processes"
```bash
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
# Global namespace - same path on all nodes
# Dataset accessible at same path on all nodes
DATASET_PATH=/shared/datasets/imagenet
# Local scratch (faster I/O, auto-cleaned)
# Copy dataset to local SSD for faster access
cp -r $DATASET_PATH $SLURM_TMPDIR/dataset
# Training with local dataset
python train.py \
--data=$SLURM_TMPDIR/dataset \
--checkpoint-dir=/shared/checkpoints \
--epochs=100
# $SLURM_TMPDIR automatically cleaned when job ends
# Checkpoints saved to shared filesystem persist
```
=== "Containers"
When using containers, shared filesystems must be explicitly mounted via bind mounts:
```bash
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
# Shared filesystem mounted at /datasets and /checkpoints
DATASET_PATH=/datasets/imagenet
# Local scratch accessible via $SLURM_TMPDIR (host storage mounted into container)
# Copy dataset to local scratch, then train
srun --container-image=/shared/images/pytorch-2.0-cuda11.8.sif \
--container-mounts=/shared/datasets:/datasets,/shared/checkpoints:/checkpoints \
cp -r $DATASET_PATH $SLURM_TMPDIR/dataset
srun --container-image=/shared/images/pytorch-2.0-cuda11.8.sif \
--container-mounts=/shared/datasets:/datasets,/shared/checkpoints:/checkpoints \
python train.py \
--data=$SLURM_TMPDIR/dataset \
--checkpoint-dir=/checkpoints \
--epochs=100
# \$SLURM_TMPDIR automatically cleaned when job ends
# Checkpoints saved to mounted shared filesystem persist
```
#### File broadcasting (sbcast)
Slurm provides `sbcast` to distribute files efficiently using its internal network topology, avoiding filesystem contention:
```bash
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks=32
# Broadcast file to all allocated nodes
srun --ntasks=1 --nodes=1 sbcast /shared/data/input.txt /tmp/input.txt
# Use broadcasted file on all nodes
srun python train.py --input=/tmp/input.txt
```
### dstack
`dstack` supports both accessing filesystems (including shared filesystems) and uploading/downloading code/data from the client.
#### Instance volumes
Instance volumes mount host directories into containers. With distributed tasks, the host can use a shared filesystem (NFS, Lustre, GPFS) to share data across jobs within the same task:
```yaml
type: task
name: distributed-train
nodes: 4
python: 3.12
repos:
- .
volumes:
# Host directory (can be on shared filesystem) mounted into container
- /mnt/shared/datasets:/data
- /mnt/shared/checkpoints:/checkpoints
commands:
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
train.py \
--data=/data \
--checkpoint-dir=/checkpoints
resources:
gpu: A100:80GB:8
memory: 200GB
```
#### Network volumes
Network volumes are persistent cloud storage (AWS EBS, GCP persistent disks, Runpod volumes).
Single-node task:
```yaml
type: task
name: train-model
python: 3.9
repos:
- .
volumes:
- name: imagenet-dataset
path: /data
commands:
- python train.py --data=/data --batch-size=64
resources:
gpu: 1
memory: 32GB
```
Network volumes cannot be used with distributed tasks (no multi-attach support), except where multi-attach is supported (Runpod) or via volume interpolation.
For distributed tasks, use interpolation to attach different volumes to each node.
```yaml
type: task
name: distributed-train
nodes: 4
python: 3.12
repos:
- .
volumes:
# Each node gets its own volume
- name: dataset-${{ dstack.node_rank }}
path: /data
commands:
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
train.py \
--data=/data
resources:
gpu: A100:80GB:8
memory: 200GB
```
Volume name interpolation is not the same as a shared filesystem—each node has its own separate volume. `dstack` currently has limited support for shared filesystems when using backend fleets.
#### Repos and files
The `repos` and `files` properties allow uploading code or data into the container.
=== "Repos"
The `repos` property clones Git repositories into the container. `dstack` clones the repo on the instance, applies local changes, and mounts it into the container. This is useful for code that needs to be version-controlled and synced.
```yaml
type: task
name: train-model
python: 3.9
repos:
- . # Clone current directory repo
commands:
- python train.py --batch-size=64
resources:
gpu: 1
memory: 32GB
cpu: 8
```
=== "Files"
The `files` property mounts local files or directories into the container. Each entry maps a local path to a container path.
```yaml
type: task
name: train-model
python: 3.9
files:
- ../configs:~/configs
- ~/.ssh/id_rsa:~/ssh/id_rsa
commands:
- python train.py --config ~/configs/model.yaml --batch-size=64
resources:
gpu: 1
memory: 32GB
cpu: 8
```
Files are uploaded to the instance and mounted into the container, but are not persisted across runs (2MB limit per file, configurable).
#### SSH file transfer
While attached to a run, you can transfer files via `rsync` or `scp` using the run name alias:
=== "rsync"
```shell
$ rsync -avz ./data/ :/path/inside/container/data/
```
=== "scp"
```shell
$ scp large-dataset.h5 :/path/inside/container/
```
> Uploading code/data from/to the client is not recommended as transfer speed greatly depends on network bandwidth between the CLI and the instance.
## Interactive development
Both Slurm and `dstack` allow allocating resources for interactive development.
| | Slurm | dstack |
|---|-------|--------|
| **Configuration** | Uses `salloc` command to allocate resources with a time limit; resources are automatically released when time expires | Uses `type: dev-environment` configurations as first-class citizen; provisions compute and runs until explicitly stopped (optional inactivity-based termination) |
| **IDE access** | Requires SSH access to allocated nodes | Native access using desktop IDEs (VS Code, Cursor, Windsurf, etc.) or SSH |
| **SSH access** | SSH to allocated nodes (host OS) using `SLURM_NODELIST` or `srun --pty` | SSH automatically configured; access via run name alias (inside container) |
### Slurm
Slurm uses `salloc` to allocate resources with a time limit. `salloc` returns a shell on the login node with environment variables set; use `srun` or SSH to access compute nodes. After the time limit expires, resources are automatically released:
```shell
$ salloc --nodes=1 --gres=gpu:1 --time=4:00:00
salloc: Granted job allocation 12346
$ srun --pty bash
[user@compute-node-01 ~]$ python train.py --epochs=1
Training epoch 1...
[user@compute-node-01 ~]$ exit
exit
$ exit
exit
salloc: Relinquishing job allocation 12346
```
Alternatively, SSH directly to allocated nodes using hostnames from `SLURM_NODELIST`:
```shell
$ ssh $SLURM_NODELIST
[user@compute-node-01 ~]$
```
### dstack
`dstack` uses `dev-environment` configuration type that automatically provisions an instance and runs until explicitly stopped, with optional inactivity-based termination. Access is provided via native desktop IDEs (VS Code, Cursor, Windsurf, etc.) or SSH:
```yaml
type: dev-environment
name: ml-dev
python: 3.12
ide: vscode
resources:
gpu: A100:80GB:1
memory: 200GB
# Optional: Maximum runtime duration (stops after this time)
max_duration: 8h
# Optional: Auto-stop after period of inactivity (no SSH/IDE connections)
inactivity_duration: 2h
# Optional: Auto-stop if GPU utilization is below threshold
utilization_policy:
min_gpu_utilization: 10 # Percentage
time_window: 1h
```
Start the dev environment:
```shell
$ dstack apply -f dev.dstack.yml
BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 9xCPU, 48GB, A5000:24GB yes $0.11
Submit the run ml-dev? [y/n]: y
Launching `ml-dev`...
---> 100%
To open in VS Code Desktop, use this link:
vscode://vscode-remote/ssh-remote+ml-dev/workflow
```
#### Port forwarding
`dstack` tasks support exposing `ports` for running interactive applications like Jupyter notebooks or Streamlit apps:
=== "Jupyter"
```yaml
type: task
name: jupyter
python: 3.12
commands:
- pip install jupyterlab
- jupyter lab --allow-root
ports:
- 8888
resources:
gpu: 1
memory: 32GB
```
=== "Streamlit"
```yaml
type: task
name: streamlit-app
python: 3.12
commands:
- pip install streamlit
- streamlit hello
ports:
- 8501
resources:
gpu: 1
memory: 32GB
```
While `dstack apply` is attached, ports are automatically forwarded to `localhost` (e.g., `http://localhost:8888` for Jupyter, `http://localhost:8501` for Streamlit).
## Job arrays
### Slurm job arrays
Slurm provides native job arrays (`--array=1-100`) that create multiple job tasks from a single submission. Job arrays can be specified via CLI argument or in the job script.
```shell
$ sbatch --array=1-100 train.sh
Submitted batch job 1001
```
Each task can use the `$SLURM_ARRAY_TASK_ID` environment variable within the job script to determine its configuration. Output files can use `%A` for the job ID and `%a` for the task ID in `#SBATCH --output` and `--error` directives.
### dstack
`dstack` does not support native job arrays. Submit multiple runs programmatically via CLI or API. Pass a custom environment variable (e.g., `TASK_ID`) to identify each run:
```shell
$ for i in {1..100}; do
dstack apply -f train.dstack.yml \
--name "train-array-task-${i}" \
--env TASK_ID=${i} \
--detach
done
```
## Environment variables and secrets
Both Slurm and `dstack` handle sensitive data (API keys, tokens, passwords) for ML workloads. Slurm uses environment variables or files, while `dstack` provides encrypted secrets management in addition to environment variables.
### Slurm
Slurm uses OS-level authentication. Jobs run with the user's UID/GID and inherit the environment from the login node. No built-in secrets management; users manage credentials in their environment or shared files.
Set environment variables in the shell before submitting (requires `--export=ALL`):
```shell
$ export HF_TOKEN=$(cat ~/.hf_token)
$ sbatch --export=ALL train.sh
Submitted batch job 12346
```
### dstack
In addition to environment variables (`env`), `dstack` provides a secrets management system with encryption. Secrets are referenced in configuration using `${{ secrets.name }}` syntax.
Set secrets:
```shell
$ dstack secret set huggingface_token
$ dstack secret set wandb_api_key
```
Use secrets in configuration:
```yaml
type: task
name: train-with-secrets
python: 3.12
repos:
- .
env:
- HF_TOKEN=${{ secrets.huggingface_token }}
- WANDB_API_KEY=${{ secrets.wandb_api_key }}
commands:
- pip install huggingface_hub
- huggingface-cli download meta-llama/Llama-2-7b-hf
- wandb login
- python train.py
resources:
gpu: A100:80GB:8
```
## Authentication
### Slurm
Slurm uses OS-level authentication. Users authenticate via SSH to login nodes using their Unix accounts. Jobs run with the user's UID/GID, ensuring user isolation—users cannot access other users' files or processes. Slurm enforces file permissions based on Unix UID/GID and association limits (MaxJobs, MaxSubmitJobs) configured per user or account.
### dstack
`dstack` uses token-based authentication. Users are registered within projects on the server, and each user is issued a token. This token is used for authentication with all CLI and API commands. Access is controlled at the project level with user roles:
| Role | Permissions |
|------|-------------|
| **Admin** | Can manage project settings, including backends, gateways, and members |
| **Manager** | Can manage project members but cannot configure backends and gateways |
| **User** | Can manage project resources including runs, fleets, and volumes |
`dstack` manages SSH keys on the server for secure access to runs and instances. User SSH keys are automatically generated and used when attaching to runs via `dstack attach` or `dstack apply`. Project SSH keys are used by the server to establish SSH connections to provisioned instances.
!!! note "Multi-tenancy isolation"
`dstack` currently does not offer full isolation for multi-tenancy. Users may access global resources within the host.
## Monitoring and observability
Both systems provide tools to monitor job/run status, cluster/node status, resource metrics, and logs:
| | Slurm | dstack |
|---|-------|--------|
| **Job/run status** | `squeue` lists jobs in queue | `dstack ps` lists active runs |
| **Cluster/node status** | `sinfo` shows node availability | `dstack fleet` lists instances |
| **CPU/memory metrics** | `sstat` for running jobs | `dstack metrics` for real-time metrics |
| **GPU metrics** | Requires SSH to nodes, `nvidia-smi` per node | Automatic collection via `nvidia-smi`/`amd-smi`, `dstack metrics` |
| **Job history** | `sacct` for completed jobs | `dstack ps -n NUM` shows run history |
| **Logs** | Written to files (`--output`, `--error`) | Streamed via API, `dstack logs` |
### Slurm
Slurm provides command-line tools for monitoring cluster state, jobs, and history.
Check node status:
```shell
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 1-00:00:00 10 idle gpu-node[01-10]
```
Check job queue:
```shell
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES
12345 gpu training user1 R 2:30 2
```
Check job details:
```shell
$ scontrol show job 12345
JobId=12345 JobName=training
UserId=user1(1001) GroupId=users(100)
NumNodes=2 NumCPUs=64 NumTasks=32
Gres=gpu:8(IDX:0,1,2,3,4,5,6,7)
```
Check resource usage for running jobs (`sstat` only works for running jobs):
```shell
$ sstat --job=12345 --format=JobID,MaxRSS,MaxVMSize,CPUUtil
JobID MaxRSS MaxVMSize CPUUtil
12345.0 2048M 4096M 95.2%
```
Check GPU usage (requires SSH to node):
```shell
$ srun --jobid=12345 --pty nvidia-smi
GPU 0: 95% utilization, 72GB/80GB memory
```
Check job history for completed jobs:
```shell
$ sacct --job=12345 --format=JobID,Elapsed,MaxRSS,State,ExitCode
JobID Elapsed MaxRSS State ExitCode
12345 2:30:00 2048M COMPLETED 0:0
```
View logs (written to files via `--output` and `--error` flags; typically in the submission directory on a shared filesystem):
```shell
$ cat slurm-12345.out
Training started...
Epoch 1/10: loss=0.5
```
If logs are on compute nodes, find the node from `scontrol show job`, then access via `srun --jobid` (running jobs) or SSH (completed jobs):
```shell
$ srun --jobid=12345 --nodelist=gpu-node01 --pty bash
$ cat slurm-12345.out
```
### dstack
`dstack` automatically collects essential metrics (CPU, memory, GPU utilization) using vendor utilities (`nvidia-smi`, `amd-smi`, etc.) and provides real-time monitoring via CLI.
List runs:
```shell
$ dstack ps
NAME BACKEND GPU PRICE STATUS SUBMITTED
training-job aws H100:1 (spot) $4.50 running 5 mins ago
```
List fleets and instances (shows GPU health status):
```shell
$ dstack fleet
FLEET INSTANCE BACKEND RESOURCES STATUS PRICE CREATED
my-fleet 0 aws (us-east-1) T4:16GB:1 idle $0.526 11 mins ago
1 aws (us-east-1) T4:16GB:1 idle (warning) $0.526 11 mins ago
```
Check real-time metrics:
```shell
$ dstack metrics training-job
NAME STATUS CPU MEMORY GPU
training-job running 45% 16.27GB/200GB gpu=0 mem=72.48GB/80GB util=95%
```
Stream logs (stored centrally using external storage services like CloudWatch Logs or GCP Logging, accessible via CLI and UI):
```shell
$ dstack logs training-job
Training started...
Epoch 1/10: loss=0.5
```
#### Prometheus integration
`dstack` exports additional metrics to Prometheus:
| Metric type | Description |
|-------------|-------------|
| **Fleet metrics** | Instance duration, price, GPU count |
| **Run metrics** | Run counters (total, terminated, failed, done) |
| **Job metrics** | Execution time, cost, CPU/memory/GPU usage |
| **DCGM telemetry** | Temperature, ECC errors, PCIe replay counters, NVLink errors |
| **Server health** | HTTP request metrics |
To enable Prometheus export, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable and configure Prometheus to scrape metrics from `/metrics`.
> GPU health monitoring is covered in the [GPU health monitoring](#gpu-health-monitoring) section below.
## Fault tolerance, checkpointing, and retry
Both systems support fault tolerance for long-running training jobs that may be interrupted by hardware failures, spot instance terminations, or other issues:
| | Slurm | dstack |
|---|-------|--------|
| **Retry** | `--requeue` flag requeues jobs on node failure (hardware crash) or preemption, not application failures (software crashes); all nodes requeued together (all-or-nothing) | `retry` property with `on_events` (`error`, `interruption`) and `duration`; all jobs stopped and run resubmitted if any job fails (all-or-nothing) |
| **Graceful stop** | Grace period with `SIGTERM` before `SIGKILL`; `--signal` sends signal before time limit (e.g., `--signal=B:USR1@300`) | Not supported |
| **Checkpointing** | Application-based; save to shared filesystem | Application-based; save to persistent volumes |
| **Instance health** | `HealthCheckProgram` in `slurm.conf` runs custom scripts (DCGM/RVS); non-zero exit drains node (excludes from new scheduling, running jobs continue) | Automatic GPU health monitoring via DCGM; unhealthy instances excluded from scheduling |
### Slurm
Slurm handles three types of failures: system failures (hardware crash), application failures (software crash), and preemption.
Enable automatic requeue on node failure (not application failures). For distributed jobs, if one node fails, the entire job is requeued (all-or-nothing):
```bash
#!/bin/bash
#SBATCH --job-name=train-with-checkpoint
#SBATCH --nodes=4
#SBATCH --gres=gpu:8
#SBATCH --time=48:00:00
#SBATCH --requeue # Requeue on node failure only
srun python train.py
```
Preempted jobs receive `SIGTERM` during a grace period before `SIGKILL` and are typically requeued automatically. Use `--signal` to send a custom signal before the time limit expires:
```bash
#!/bin/bash
#SBATCH --job-name=train-with-checkpoint
#SBATCH --nodes=4
#SBATCH --gres=gpu:8
#SBATCH --time=48:00:00
#SBATCH --signal=B:USR1@300 # Send USR1 5 minutes before time limit
trap 'python save_checkpoint.py --checkpoint-dir=/shared/checkpoints' USR1
if [ -f /shared/checkpoints/latest.pt ]; then
RESUME_FLAG="--resume /shared/checkpoints/latest.pt"
fi
srun python train.py \
--checkpoint-dir=/shared/checkpoints \
$RESUME_FLAG
```
Checkpoints are saved to a shared filesystem. Applications must implement checkpointing logic.
Custom health checks are configured via `HealthCheckProgram` in `slurm.conf`:
```bash
HealthCheckProgram=/shared/scripts/gpu_health_check.sh
```
The health check script should exit with non-zero code to drain the node:
```bash
#!/bin/bash
dcgmi diag -r 1
if [ $? -ne 0 ]; then
exit 1 # Non-zero exit drains node
fi
```
Drained nodes are excluded from new scheduling, but running jobs continue until completion.
### dstack
`dstack` handles three types of failures: provisioning failures (`no-capacity`), job failures (`error`), and interruptions (`interruption`). The `error` event is triggered by application failures (non-zero exit code) and instance unreachable issues. The `interruption` event is triggered by spot instance terminations and network/hardware issues.
By default, runs fail immediately. Enable retry via the `retry` property to handle these events:
```yaml
type: task
name: train-with-checkpoint-retry
nodes: 4
python: 3.12
repos:
- .
volumes:
# Use instance volumes (host directories) or network volumes (cloud-managed persistent storage)
- name: checkpoint-volume
path: /checkpoints
commands:
- |
if [ -f /checkpoints/latest.pt ]; then
RESUME_FLAG="--resume /checkpoints/latest.pt"
fi
python train.py \
--checkpoint-dir=/checkpoints \
$RESUME_FLAG
resources:
gpu: A100:80GB:8
memory: 200GB
spot_policy: auto
retry:
on_events: [error, interruption]
duration: 48h
```
For distributed tasks, if any job fails and retry is enabled, all jobs are stopped and the run is resubmitted (all-or-nothing).
Unlike Slurm, `dstack` does not support graceful shutdown signals. Applications must implement proactive checkpointing (periodic saves) and check for existing checkpoints on startup to resume after retries.
## GPU health monitoring
Both systems monitor GPU health to prevent degraded hardware from affecting workloads:
| | Slurm | dstack |
|---|-------|--------|
| **Health checks** | Custom scripts (DCGM/RVS) via `HealthCheckProgram` in `slurm.conf`; typically active diagnostics (`dcgmi diag`) or passive health watches | Automatic DCGM health watches (passive, continuous monitoring) |
| **Failure handling** | Non-zero exit drains node (excludes from new scheduling, running jobs continue); status: DRAIN/DRAINED | Unhealthy instances excluded from scheduling; status shown in `dstack fleet`: `idle` (healthy), `idle (warning)`, `idle (failure)` |
### Slurm
Configure custom health check scripts via `HealthCheckProgram` in `slurm.conf`. Scripts typically use DCGM diagnostics (`dcgmi diag`) for NVIDIA GPUs or RVS for AMD GPUs:
```bash
HealthCheckProgram=/shared/scripts/gpu_health_check.sh
```
```bash
#!/bin/bash
dcgmi diag -r 1 # DCGM diagnostic for NVIDIA GPUs
if [ $? -ne 0 ]; then
exit 1 # Non-zero exit drains node
fi
```
Drained nodes are excluded from new scheduling, but running jobs continue until completion.
### dstack
`dstack` automatically monitors GPU health using DCGM background health checks on instances with NVIDIA GPUs. Supported on cloud backends where DCGM is pre-installed automatically (or comes with users' `os_images`) and SSH fleets where DCGM packages (`datacenter-gpu-manager-4-core`, `datacenter-gpu-manager-4-proprietary`, `datacenter-gpu-manager-exporter`) are installed on hosts.
> AMD GPU health monitoring is not supported yet.
Health status is displayed in `dstack fleet`:
```shell
$ dstack fleet
FLEET INSTANCE BACKEND RESOURCES STATUS PRICE CREATED
my-fleet 0 aws (us-east-1) T4:16GB:1 idle $0.526 11 mins ago
1 aws (us-east-1) T4:16GB:1 idle (warning) $0.526 11 mins ago
2 aws (us-east-1) T4:16GB:1 idle (failure) $0.526 11 mins ago
```
Health status:
| Status | Description |
|--------|-------------|
| `idle` | Healthy, no issues detected |
| `idle (warning)` | Non-fatal issues (e.g., correctable ECC errors); instance still usable |
| `idle (failure)` | Fatal issues (uncorrectable ECC, PCIe failures); instance excluded from scheduling |
GPU health metrics are also exported to Prometheus (see [Prometheus integration](#prometheus-integration)).
## Job dependencies
Job dependencies enable chaining tasks together, ensuring that downstream jobs only run after upstream jobs complete.
### Slurm dependencies
Slurm provides native dependency support via `--dependency` flags. Dependencies are managed by Slurm:
| Dependency type | Description |
|----------------|-------------|
| **`afterok`** | Runs only if the dependency job finishes with Exit Code 0 (success) |
| **`afterany`** | Runs regardless of success or failure (useful for cleanup jobs) |
| **`aftercorr`** | For array jobs, allows corresponding tasks to start as soon as the matching task in the dependency array completes (e.g., Task 1 of Array B starts when Task 1 of Array A finishes, without waiting for the entire Array A) |
| **`singleton`** | Based on job name and user (not job IDs), ensures only one job with the same name runs at a time for that user (useful for serializing access to shared resources) |
Submit a job that depends on another job completing successfully:
```shell
$ JOB_TRAIN=$(sbatch train.sh | awk '{print $4}')
Submitted batch job 1001
$ sbatch --dependency=afterok:$JOB_TRAIN evaluate.sh
Submitted batch job 1002
```
Submit a job with singleton dependency (only one job with this name runs at a time):
```shell
$ sbatch --job-name=ModelTraining --dependency=singleton train.sh
Submitted batch job 1004
```
### dstack { #dstack-workflow-orchestration }
`dstack` does not support native job dependencies. Use external workflow orchestration tools (Airflow, Prefect, etc.) to implement dependencies.
=== "Prefect"
```python
from prefect import flow, task
import subprocess
@task
def train_model():
"""Submit training job and wait for completion"""
subprocess.run(
["dstack", "apply", "-f", "train.dstack.yml", "--name", "train-run"],
check=True # Raises exception if training fails
)
return "train-run"
@task
def evaluate_model(run_name):
"""Submit evaluation job after training succeeds"""
subprocess.run(
["dstack", "apply", "-f", "evaluate.dstack.yml", "--name", f"eval-{run_name}"],
check=True
)
@flow
def ml_pipeline():
train_run = train_model()
evaluate_model(train_run)
```
=== "Airflow"
```python
from airflow.decorators import dag, task
from datetime import datetime
import subprocess
@dag(schedule=None, start_date=datetime(2024, 1, 1), catchup=False)
def ml_training_pipeline():
@task
def train(context):
"""Submit training job and wait for completion"""
run_name = f"train-{context['ds']}"
subprocess.run(
["dstack", "apply", "-f", "train.dstack.yml", "--name", run_name],
check=True # Raises exception if training fails
)
return run_name
@task
def evaluate(run_name, context):
"""Submit evaluation job after training succeeds"""
eval_name = f"eval-{run_name}"
subprocess.run(
["dstack", "apply", "-f", "evaluate.dstack.yml", "--name", eval_name],
check=True
)
# Define task dependencies - train() completes before evaluate() starts
train_run = train()
evaluate(train_run)
ml_training_pipeline()
```
## Heterogeneous jobs
Heterogeneous jobs (het jobs) allow a single job to request different resource configurations for different components (e.g., GPU nodes for training, high-memory CPU nodes for preprocessing). This is an edge case used for coordinated multi-component workflows.
### Slurm
Slurm supports heterogeneous jobs via `#SBATCH hetjob` and `--het-group` flags. Each component can specify different resources:
```bash
#!/bin/bash
#SBATCH --job-name=ml-pipeline
#SBATCH hetjob
#SBATCH --het-group=0 --nodes=2 --gres=gpu:8 --mem=200G
#SBATCH --het-group=1 --nodes=1 --mem=500G --partition=highmem
# Use SLURM_JOB_COMPONENT_ID to identify the component
if [ "$SLURM_JOB_COMPONENT_ID" -eq 0 ]; then
srun python train.py
elif [ "$SLURM_JOB_COMPONENT_ID" -eq 1 ]; then
srun python preprocess.py
fi
```
### dstack
`dstack` does not support heterogeneous jobs natively. Use separate runs with [workflow orchestration tools (Prefect, Airflow)](#dstack-workflow-orchestration) or submit multiple runs programmatically to coordinate components with different resource requirements.
## What's next?
1. Check out [Quickstart](../../quickstart.md)
2. Read about [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), and [services](../../concepts/services.md)
3. Browse the [examples](../../examples.md)
# docs/examples/training/trl.md
---
title: TRL
description: Fine-tuning Llama with TRL — single-node SFT with QLoRA, or distributed across multiple nodes with FSDP and DeepSpeed
---
# TRL
This example walks you through how to use [TRL](https://github.com/huggingface/trl) with `dstack` to fine-tune `Llama-3.1-8B` — on a single node with SFT and QLoRA, or distributed across multiple nodes with [Accelerate](https://github.com/huggingface/accelerate) and [DeepSpeed](https://github.com/deepspeedai/DeepSpeed).
## Single-node training
### Define a configuration
Below is a task configuration that does fine-tuning.
```yaml
type: task
name: trl-train
python: 3.12
# Ensure nvcc is installed (req. for Flash Attention)
nvcc: true
env:
- HF_TOKEN
- WANDB_API_KEY
- HUB_MODEL_ID
commands:
# Pin torch==2.6.0 to avoid building Flash Attention from source.
# Prebuilt Flash Attention wheels are not available for the latest torch==2.7.0.
- uv pip install torch==2.6.0
- uv pip install transformers bitsandbytes peft wandb
- uv pip install flash_attn --no-build-isolation
- git clone https://github.com/huggingface/trl
- cd trl
- uv pip install .
- |
accelerate launch \
--config_file=examples/accelerate_configs/multi_gpu.yaml \
--num_processes $DSTACK_GPUS_PER_NODE \
trl/scripts/sft.py \
--model_name meta-llama/Meta-Llama-3.1-8B \
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
--dataset_text_field="text" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--report_to wandb \
--bf16 \
--max_seq_length 1024 \
--lora_r 16 \
--lora_alpha 32 \
--lora_target_modules q_proj k_proj v_proj o_proj \
--load_in_4bit \
--use_peft \
--attn_implementation "flash_attention_2" \
--logging_steps=10 \
--output_dir models/llama31 \
--hub_model_id peterschmidt85/FineLlama-3.1-8B
resources:
gpu:
# 24GB or more VRAM
memory: 24GB..
# One or more GPU
count: 1..
# Shared memory (for multi-gpu)
shm_size: 24GB
```
Change the `resources` property to specify more GPUs.
!!! info "AMD"
The example above uses NVIDIA accelerators. To use it with AMD, check out [AMD](../accelerators/amd.md#trl).
??? info "DeepSpeed"
For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.
To do this, use the `examples/accelerate_configs/deepspeed_zero3.yaml` configuration file instead of
`examples/accelerate_configs/multi_gpu.yaml`.
### Run the configuration
Once the configuration is ready, run `dstack apply -f `, and `dstack` will automatically provision the
cloud resources and run the configuration.
```shell
$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ HUB_MODEL_ID=...
$ dstack apply -f train.dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 vastai (cz-czechia) cpu=64 mem=128GB H100:80GB:2 18794506 $3.8907
2 vastai (us-texas) cpu=52 mem=64GB H100:80GB:2 20442365 $3.6926
3 vastai (fr-france) cpu=64 mem=96GB H100:80GB:2 20379984 $3.7389
Submit the run trl-train? [y/n]:
Provisioning...
---> 100%
```
## Distributed training
!!! info "Prerequisites"
Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](../../concepts/fleets.md#cluster-placement) or an [SSH fleet](../../concepts/fleets.md#ssh-placement)).
### Define a configuration
Once the fleet is created, define a distributed task configuration. Here's an example using either FSDP or DeepSpeed ZeRO-3.
=== "FSDP"
```yaml
type: task
name: trl-train-fsdp-distrib
nodes: 2
image: nvcr.io/nvidia/pytorch:25.01-py3
env:
- HF_TOKEN
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
- MODEL_ID=meta-llama/Llama-3.1-8B
- HUB_MODEL_ID
commands:
- pip install transformers bitsandbytes peft wandb
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- |
accelerate launch \
--config_file=examples/accelerate_configs/fsdp1.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM \
trl/scripts/sft.py \
--model_name $MODEL_ID \
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
--dataset_text_field="text" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--report_to wandb \
--bf16 \
--max_seq_length 1024 \
--attn_implementation flash_attention_2 \
--logging_steps=10 \
--output_dir /checkpoints/llama31-ft \
--hub_model_id $HUB_MODEL_ID \
--torch_dtype bfloat16
resources:
gpu: 80GB:8
shm_size: 128GB
volumes:
- /checkpoints:/checkpoints
```
=== "DeepSpeed ZeRO-3"
```yaml
type: task
name: trl-train-deepspeed-distrib
nodes: 2
image: nvcr.io/nvidia/pytorch:25.01-py3
env:
- HF_TOKEN
- WANDB_API_KEY
- HUB_MODEL_ID
- MODEL_ID=meta-llama/Llama-3.1-8B
- ACCELERATE_LOG_LEVEL=info
commands:
- pip install transformers bitsandbytes peft wandb deepspeed
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- |
accelerate launch \
--config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM \
trl/scripts/sft.py \
--model_name $MODEL_ID \
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
--dataset_text_field="text" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--report_to wandb \
--bf16 \
--max_seq_length 1024 \
--attn_implementation flash_attention_2 \
--logging_steps=10 \
--output_dir /checkpoints/llama31-ft \
--hub_model_id $HUB_MODEL_ID \
--torch_dtype bfloat16
resources:
gpu: 80GB:8
shm_size: 128GB
volumes:
- /checkpoints:/checkpoints
```
!!! info "Docker image"
We are using `nvcr.io/nvidia/pytorch:25.01-py3` from NGC because it includes the necessary libraries and packages for RDMA and InfiniBand support.
### Run the configuration
To run a configuration, use the [`dstack apply`](../../reference/cli/dstack/apply.md) command.
```shell
$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ HUB_MODEL_ID=...
$ dstack apply -f train-distrib.dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
Submit the run trl-train-fsdp-distrib? [y/n]: y
Provisioning...
---> 100%
```
## What's next?
1. Check [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md),
[services](../../concepts/services.md), and [fleets](../../concepts/fleets.md)
2. Read about [cluster placement](../../concepts/fleets.md#cluster-placement)
3. See the [AMD](../accelerators/amd.md#trl) example
# docs/examples/training/axolotl.md
---
title: Axolotl
description: Fine-tuning Llama models with Axolotl — single-node SFT with FSDP and QLoRA, or distributed across multiple nodes
---
# Axolotl
This example shows how to use [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) with `dstack` to fine-tune Llama models — on a single node with SFT, FSDP, and QLoRA, or distributed across multiple nodes.
## Single-node training
This section walks through fine-tuning 4-bit quantized `Llama-4-Scout-17B-16E` using SFT with FSDP and QLoRA.
### Define a configuration
Axolotl reads the model, QLoRA, and dataset arguments, as well as trainer configuration from a [`scout-qlora-flexattn-fsdp2.yaml`](https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-4/scout-qlora-flexattn-fsdp2.yaml) file. The configuration uses 4-bit axolotl quantized version of `meta-llama/Llama-4-Scout-17B-16E`, requiring only ~43GB VRAM/GPU with 4K context length.
Below is a task configuration that does fine-tuning.
```yaml
type: task
# The name is optional, if not specified, generated randomly
name: axolotl-nvidia-llama-scout-train
# Using the official Axolotl's Docker image
image: axolotlai/axolotl:main-latest
# Required environment variables
env:
- HF_TOKEN
- WANDB_API_KEY
- WANDB_PROJECT
- HUB_MODEL_ID
# Commands of the task
commands:
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-flexattn-fsdp2.yaml
- |
axolotl train scout-qlora-flexattn-fsdp2.yaml \
--wandb-project $WANDB_PROJECT \
--wandb-name $DSTACK_RUN_NAME \
--hub-model-id $HUB_MODEL_ID
resources:
# Four GPU (required by FSDP)
gpu: H100:4
# Shared memory size for inter-process communication
shm_size: 64GB
disk: 500GB..
```
The task uses Axolotl's Docker image, where Axolotl is already pre-installed.
!!! info "AMD"
The example above uses NVIDIA accelerators. To use it with AMD, check out [AMD](../accelerators/amd.md#axolotl).
### Run the configuration
Once the configuration is ready, run `dstack apply -f `, and `dstack` will automatically provision the
cloud resources and run the configuration.
```shell
$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ WANDB_PROJECT=...
$ HUB_MODEL_ID=...
$ dstack apply -f train.dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 vastai (cz-czechia) cpu=64 mem=128GB H100:80GB:2 18794506 $3.8907
2 vastai (us-texas) cpu=52 mem=64GB H100:80GB:2 20442365 $3.6926
3 vastai (fr-france) cpu=64 mem=96GB H100:80GB:2 20379984 $3.7389
Submit the run axolotl-nvidia-llama-scout-train? [y/n]:
Provisioning...
---> 100%
```
## Distributed training
!!! info "Prerequisites"
Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](../../concepts/fleets.md#cluster-placement) or an [SSH fleet](../../concepts/fleets.md#ssh-placement)).
This section walks through running distributed fine-tuning of `Llama-3.1-70B` with QLoRA and FSDP across multiple nodes.
### Define a configuration
Once the fleet is created, define a distributed task configuration. Here's an example of a distributed `QLoRA` task using `FSDP`.
```yaml
type: task
name: axolotl-multi-node-qlora-llama3-70b
nodes: 2
image: nvcr.io/nvidia/pytorch:25.01-py3
env:
- HF_TOKEN
- WANDB_API_KEY
- WANDB_PROJECT
- HUB_MODEL_ID
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
- NCCL_DEBUG=INFO
- ACCELERATE_LOG_LEVEL=info
commands:
# Replacing the default Torch and FlashAttention in the NCG container with Axolotl-compatible versions.
# The preinstalled versions are incompatible with Axolotl.
- pip uninstall -y torch flash-attn
- pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
- pip install --no-build-isolation axolotl[flash-attn,deepspeed]
- wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
# Axolotl includes hf-xet version 1.1.0, which fails during downloads. Replacing it with the latest version (1.1.2).
- pip uninstall -y hf-xet
- pip install hf-xet --no-cache-dir
- |
accelerate launch \
--config_file=fsdp1.yaml \
-m axolotl.cli.train qlora-fsdp-70b.yaml \
--hub-model-id $HUB_MODEL_ID \
--output-dir /checkpoints/qlora-llama3-70b \
--wandb-project $WANDB_PROJECT \
--wandb-name $DSTACK_RUN_NAME \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM
resources:
gpu: 80GB:8
shm_size: 128GB
volumes:
- /checkpoints:/checkpoints
```
!!! info "Docker image"
We are using `nvcr.io/nvidia/pytorch:25.01-py3` from NGC because it includes the necessary libraries and packages for RDMA and InfiniBand support.
### Run the configuration
To run a configuration, use the [`dstack apply`](../../reference/cli/dstack/apply.md) command.
```shell
$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ WANDB_PROJECT=...
$ HUB_MODEL_ID=...
$ dstack apply -f train-distrib.dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
Submit the run axolotl-multi-node-qlora-llama3-70b? [y/n]: y
Provisioning...
---> 100%
```
## What's next?
1. Check [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md),
[services](../../concepts/services.md), and [fleets](../../concepts/fleets.md)
2. Read about [cluster placement](../../concepts/fleets.md#cluster-placement)
3. See the [AMD](../accelerators/amd.md#axolotl) example
# docs/examples/training/ray-ragen.md
---
title: Ray + RAGEN
description: Multi-node agent fine-tuning using RAGEN with Ray and verl for reinforcement learning
---
# Ray + RAGEN
This example shows how use `dstack` and [RAGEN](https://github.com/RAGEN-AI/RAGEN)
to fine-tune an agent on multiple nodes.
Under the hood `RAGEN` uses [verl](https://github.com/volcengine/verl) for Reinforcement Learning and [Ray](https://docs.ray.io/en/latest/) for distributed training.
!!! info "Prerequisites"
Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](../../concepts/fleets.md#cluster-placement) or an [SSH fleet](../../concepts/fleets.md#ssh-placement)).
## Run a Ray cluster
If you want to use Ray with `dstack`, you have to first run a Ray cluster.
The task below runs a Ray cluster on an existing fleet:
```yaml
type: task
name: ray-cluster
nodes: 2
env:
- WANDB_API_KEY
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
commands:
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- bash miniconda.sh -b -p /workflow/miniconda
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
- git clone https://github.com/RAGEN-AI/RAGEN.git
- cd RAGEN
- bash scripts/setup_ragen.sh
- conda activate ragen
- cd verl
- pip install --no-deps -e .
- pip install hf_transfer hf_xet
- pip uninstall -y ray
- pip install -U "ray[default]"
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
ray start --head --port=6379;
else
ray start --address=$DSTACK_MASTER_NODE_IP:6379
fi
# Expose Ray dashboard port
ports:
- 8265
resources:
gpu: 80GB:8
shm_size: 128GB
# Save checkpoints on the instance
volumes:
- /checkpoints:/checkpoints
```
We are using verl's docker image for vLLM with FSDP. See [Installation](https://verl.readthedocs.io/en/latest/start/install.html) for more.
The `RAGEN` setup script `scripts/setup_ragen.sh` isolates dependencies within Conda environment.
Note that the Ray setup in the RAGEN environment is missing the dashboard, so we reinstall it using `ray[default]`.
Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`.
```shell
$ dstack apply -f ray-cluster.dstack.yml
```
As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution.
If `dstack apply` is detached, you can use `dstack attach` to re-attach.
## Submit Ray jobs
Before you can submit Ray jobs, ensure to install `ray` locally:
```shell
$ pip install ray
```
Now you can submit the training job to the Ray cluster which is available at `localhost:8265`:
```shell
$ RAY_ADDRESS=http://localhost:8265
$ ray job submit \
-- bash -c "\
export PYTHONPATH=/workflow/RAGEN; \
cd /workflow/RAGEN; \
/workflow/miniconda/envs/ragen/bin/python train.py \
--config-name base \
system.CUDA_VISIBLE_DEVICES=[0,1,2,3,4,5,6,7] \
model_path=Qwen/Qwen2.5-7B-Instruct \
trainer.experiment_name=agent-fine-tuning-Qwen2.5-7B \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
micro_batch_size_per_gpu=2 \
trainer.default_local_dir=/checkpoints \
trainer.save_freq=50 \
actor_rollout_ref.rollout.tp_size_check=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=4"
```
!!! info "Training parameters"
1. `actor_rollout_ref.rollout.tensor_model_parallel_size=4`, because `Qwen/Qwen2.5-7B-Instruct` has 28 attention heads and number of attention heads should be divisible by `tensor_model_parallel_size`
2. `actor_rollout_ref.rollout.tp_size_check=False`, if True `tensor_model_parallel_size` should be equal to `trainer.n_gpus_per_node`
3. `micro_batch_size_per_gpu=2`, to keep the RAGEN-paper's `rollout_filter_ratio` and `es_manager` settings as it is for world size `16`
Using Ray via `dstack` is a powerful way to get access to the rich Ray ecosystem while benefiting from `dstack`'s provisioning capabilities.
!!! info "What's next"
1. Read about [distributed tasks](../../concepts/tasks.md#distributed-tasks), [fleets](../../concepts/fleets.md), and [cluster placement](../../concepts/fleets.md#cluster-placement)
2. Browse Ray's [docs](https://docs.ray.io/en/latest/train/examples.html) for other examples.
# docs/examples/clusters/aws.md
---
title: AWS
description: High-performance distributed training on AWS using Elastic Fabric Adapter (EFA)
---
# AWS
In this guide, we'll walk through how to run high-performance distributed training on AWS using [Amazon Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) with `dstack`.
## Overview
EFA is a network interface for Amazon EC2 that enables low-latency, high-bandwidth inter-node communication — essential for scaling distributed deep learning. With `dstack`, EFA is automatically enabled when you create fleets with supported instance types.
## Prerequisite
Before you start, make sure the `aws` backend is properly configured.
```yaml
projects:
- name: main
backends:
- type: aws
creds:
type: default
regions: ["us-west-2"]
```
!!! info "VPC"
If you use a custom VPC, verify that it permits all internal traffic between nodes for EFA to function properly
## Create a fleet
Once your backend is ready, define a fleet configuration.
```yaml
type: fleet
name: efa-fleet
nodes: 2
placement: cluster
resources:
gpu: H100:8
```
Provision the fleet with `dstack apply`:
```shell
$ dstack apply -f efa-fleet.dstack.yml
Provisioning...
---> 100%
FLEET INSTANCE BACKEND INSTANCE TYPE GPU PRICE STATUS CREATED
efa-fleet 0 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago
1 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago
```
??? info "Instance types"
`dstack` selects suitable instances automatically, but not
[all types support EFA](https://aws.amazon.com/hpc/efa/).
To enforce EFA, you can specify `instance_types` explicitly:
```yaml
type: fleet
name: efa-fleet
nodes: 2
placement: cluster
resources:
gpu: L4
instance_types: ["g6.8xlarge"] # If not specified, g6.xlarge is used (won't have EFA)
```
## Run NCCL tests
To confirm that EFA is working, run NCCL tests:
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
resources:
gpu: 1..8
shm_size: 16GB
```
Run it with `dstack apply`:
```shell
$ dstack apply -f nccl-tests.dstack.yml
Provisioning...
---> 100%
```
!!! info "Docker image"
You can use your own container by setting `image`. If omitted, `dstack` uses its default image with drivers, NCCL tests, and tools pre-installed.
## Run distributed training
Here’s an example using `torchrun` for a simple multi-node PyTorch job:
```yaml
type: task
name: train-distrib
nodes: 2
python: 3.12
env:
- NCCL_DEBUG=INFO
commands:
- git clone https://github.com/pytorch/examples.git pytorch-examples
- cd pytorch-examples/distributed/ddp-tutorial-series
- uv pip install -r requirements.txt
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
multinode.py 50 10
resources:
gpu: 1..8
shm_size: 16GB
```
Provision and launch it via `dstack apply`.
```shell
$ dstack apply -f train-distrib.dstack.yml
Provisioning...
---> 100%
```
Instead of setting `python`, you can specify your own Docker image using `image`. Make sure that the image is properly configured for EFA.
!!! info "What's next"
1. Learn more about [distributed tasks](../../concepts/tasks.md#distributed-tasks) and [cluster placement](../../concepts/fleets.md#cluster-placement)
2. Check [dev environments](../../concepts/dev-environments.md),
[services](../../concepts/services.md), and [fleets](../../concepts/fleets.md)
# docs/examples/clusters/gcp.md
---
title: GCP
description: Creating and using GPU clusters on GCP with GPUDirect-TCPX and RoCE support
---
# GCP
This example shows how to create and use clusters on GCP.
`dstack` supports the following instance types:
| Instance type | GPU | Maximum bandwidth | Fabric |
| ------------- | ------ | ----------------- | ---------------------------------------------------------------------------------------------------------------- |
| **A3 Edge** | H100:8 | 0.8 Tbps | [GPUDirect-TCPX](https://cloud.google.com/compute/docs/gpus/gpudirect) |
| **A3 High** | H100:8 | 1 Tbps | [GPUDirect-TCPX](https://cloud.google.com/compute/docs/gpus/gpudirect) |
| **A3 Mega** | H100:8 | 1.8 Tbps | [GPUDirect-TCPXO](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot) |
| **A4** | B200:8 | 3.2 Tbps | RoCE |
## Configure the backend
Despite hiding most of the complexity, `dstack` still requires instance-specific backend configuration:
=== "A4"
A4 requires one `extra_vpcs` for inter-node traffic (regular VPC, one subnet) and one `roce_vpcs` for GPU-to-GPU communication (RoCE profile, eight subnets).
```yaml
projects:
- name: main
backends:
- type: gcp
# Specify your GCP project ID
project_id:
extra_vpcs: [dstack-gvnic-net-1]
roce_vpcs: [dstack-mrdma]
# Specify the regions you intend to use
regions: [us-west2]
creds:
type: default
```
Create extra and RoCE VPCs
See GCP's [RoCE network setup guide](https://cloud.google.com/ai-hypercomputer/docs/create/create-vm#setup-network) for the commands to create
VPCs and filewall rules.
Ensure VPCs allow internal traffic between nodes for MPI/NCCL to function.
=== "A3 Mega"
A3 Edge/High require at least 4 `extra_vpcs` for data NICs.
```yaml
projects:
- name: main
backends:
- type: gcp
# Specify your GCP project ID
project_id:
extra_vpcs:
- dstack-gpu-data-net-1
- dstack-gpu-data-net-2
- dstack-gpu-data-net-3
- dstack-gpu-data-net-4
- dstack-gpu-data-net-5
- dstack-gpu-data-net-6
- dstack-gpu-data-net-7
- dstack-gpu-data-net-8
# Specify the regions you intend to use
regions: [europe-west4]
creds:
type: default
```
Create extra VPCs
Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule:
```shell
# Specify the region where you intend to deploy the cluster
REGION="europe-west4"
for N in $(seq 1 8); do
gcloud compute networks create dstack-gpu-data-net-$N \
--subnet-mode=custom \
--mtu=8244
gcloud compute networks subnets create dstack-gpu-data-sub-$N \
--network=dstack-gpu-data-net-$N \
--region=$REGION \
--range=192.168.$N.0/24
gcloud compute firewall-rules create dstack-gpu-data-internal-$N \
--network=dstack-gpu-data-net-$N \
--action=ALLOW \
--rules=tcp:0-65535,udp:0-65535,icmp \
--source-ranges=192.168.0.0/16
done
```
=== "A3 High/Edge"
A3 Edge/High require at least 4 `extra_vpcs` for data NICs and a `vm_service_account` authorized to pull GPUDirect Docker images.
```yaml
projects:
- name: main
backends:
- type: gcp
# Specify your GCP project ID
project_id:
extra_vpcs:
- dstack-gpu-data-net-1
- dstack-gpu-data-net-2
- dstack-gpu-data-net-3
- dstack-gpu-data-net-4
# Specify the regions you intend to use
regions: [europe-west4]
# Specify your GCP project ID
vm_service_account: a3cluster-sa@$.iam.gserviceaccount.com
creds:
type: default
```
Create extra VPCs
Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule:
```shell
# Specify the region where you intend to deploy the cluster
REGION="europe-west4"
for N in $(seq 1 4); do
gcloud compute networks create dstack-gpu-data-net-$N \
--subnet-mode=custom \
--mtu=8244
gcloud compute networks subnets create dstack-gpu-data-sub-$N \
--network=dstack-gpu-data-net-$N \
--region=$REGION \
--range=192.168.$N.0/24
gcloud compute firewall-rules create dstack-gpu-data-internal-$N \
--network=dstack-gpu-data-net-$N \
--action=ALLOW \
--rules=tcp:0-65535,udp:0-65535,icmp \
--source-ranges=192.168.0.0/16
done
```
Create a service account
Create a VM service account that allows VMs to access the `pkg.dev` registry:
```shell
PROJECT_ID=$(gcloud config get-value project)
gcloud iam service-accounts create a3cluster-sa \
--display-name "Service Account for pulling GCR images"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:a3cluster-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
--role="roles/artifactregistry.reader"
```
!!! info "Default VPC"
If you set a non-default `vpc_name` in the backend configuration, ensure it allows all inter-node traffic. This is required for MPI and NCCL. The default VPC already allows this.
## Create a fleet
Once you've configured the `gcp` backend, create the fleet configuration:
=== "A4"
```yaml
type: fleet
name: a4-fleet
placement: cluster
# Can be a range on a fixed number
nodes: 2
# Specify the zone where you have configured the RoCE VPC
availability_zones: [us-west2-c]
backends: [gcp]
# Uncomment to allow spot instances
#spot_policy: auto
resources:
gpu: B200:8
```
Then apply it with `dstack apply`:
```shell
$ dstack apply -f a4-fleet.dstack.yml
Provisioning...
---> 100%
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
a4-fleet 0 gcp (us-west2) B200:180GB:8 (spot) $51.552 idle 9 mins ago
1 gcp (us-west2) B200:180GB:8 (spot) $51.552 idle 9 mins ago
```
=== "A3 Mega"
```yaml
type: fleet
name: a3mega-fleet
placement: cluster
# Can be a range on a fixed number
nodes: 2
instance_types:
- a3-megagpu-8g
# Uncomment to allow spot instances
#spot_policy: auto
```
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f a3mega-fleet.dstack.yml
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
a3mega-fleet 1 gcp (europe-west4) H100:80GB:8 $22.1525 (spot) idle 9 mins ago
a3mega-fleet 2 gcp (europe-west4) H100:80GB:8 $64.2718 idle 9 mins ago
Create the fleet? [y/n]: y
Provisioning...
---> 100%
```
=== "A3 High/Edge"
```yaml
type: fleet
name: a3high-fleet
placement: cluster
nodes: 2
instance_types:
- a3-highgpu-8g
# Uncomment to allow spot instances
#spot_policy: auto
```
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f a3high-fleet.dstack.yml
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
a3mega-fleet 1 gcp (europe-west4) H100:80GB:8 $20.5688 (spot) idle 9 mins ago
a3mega-fleet 2 gcp (europe-west4) H100:80GB:8 $58.5419 idle 9 mins ago
Create the fleet? [y/n]: y
Provisioning...
---> 100%
```
Once the fleet is created, you can run distributed tasks, in addition to dev environments, services, and regular tasks.
## Run tasks
### NCCL tests
Use a distributed task that runs NCCL tests to validate cluster network bandwidth.
=== "A4"
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f nccl-tests.dstack.yml
Provisioning...
---> 100%
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
size count type redop root time algbw busbw wrong time algbw busbw wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 2097152 float sum -1 156.9 53.47 100.25 0 167.6 50.06 93.86 0
16777216 4194304 float sum -1 196.3 85.49 160.29 0 206.2 81.37 152.57 0
33554432 8388608 float sum -1 258.5 129.82 243.42 0 261.8 128.18 240.33 0
67108864 16777216 float sum -1 369.4 181.69 340.67 0 371.2 180.79 338.98 0
134217728 33554432 float sum -1 638.5 210.22 394.17 0 587.2 228.57 428.56 0
268435456 67108864 float sum -1 940.3 285.49 535.29 0 950.7 282.36 529.43 0
536870912 134217728 float sum -1 1695.2 316.70 593.81 0 1666.9 322.08 603.89 0
1073741824 268435456 float sum -1 3229.9 332.44 623.33 0 3201.8 335.35 628.78 0
2147483648 536870912 float sum -1 6107.7 351.61 659.26 0 6157.1 348.78 653.97 0
4294967296 1073741824 float sum -1 11952 359.36 673.79 0 11942 359.65 674.34 0
8589934592 2147483648 float sum -1 23563 364.55 683.52 0 23702 362.42 679.54 0
Out of bounds values : 0 OK
Avg bus bandwidth : 165.789
```
=== "A3 Mega"
```yaml
type: task
name: nccl-tests
nodes: 2
image: nvcr.io/nvidia/pytorch:24.04-py3
entrypoint: "bash -c" # Need to use bash instead of default dash for nccl-env-profile.sh
commands:
- |
# Setup TCPXO NCCL env variables
NCCL_LIB_DIR="/var/lib/tcpxo/lib64"
source ${NCCL_LIB_DIR}/nccl-env-profile-ll128.sh
export NCCL_FASTRAK_CTRL_DEV=enp0s12
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices"
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
# Build NCCL Tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
MPI=1 CC=mpicc CXX=mpicxx make -j
cd build
# We use FIFO for inter-node communication
FIFO=/tmp/dstack_job
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
sleep 10
echo "${DSTACK_NODES_IPS}" > hostfile
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
# Wait for other nodes
while true; do
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
break
fi
echo 'Waiting for nodes...'
sleep 5
done
# Run NCCL Tests
${MPIRUN} \
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 \
$(env | awk -F= '{print "-x", $1}' | xargs) \
./all_gather_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200 -c 0;
# Notify nodes the job is done
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
else
mkfifo ${FIFO}
# Wait for a message from the first node
cat ${FIFO}
fi
spot_policy: auto
resources:
shm_size: 16GB
```
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f nccl-tests.dstack.yml
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 131072 float none -1 166.6 50.34 47.19 N/A 164.1 51.11 47.92 N/A
16777216 262144 float none -1 204.6 82.01 76.89 N/A 203.8 82.30 77.16 N/A
33554432 524288 float none -1 284.0 118.17 110.78 N/A 281.7 119.12 111.67 N/A
67108864 1048576 float none -1 447.4 150.00 140.62 N/A 443.5 151.31 141.86 N/A
134217728 2097152 float none -1 808.3 166.05 155.67 N/A 801.9 167.38 156.92 N/A
268435456 4194304 float none -1 1522.1 176.36 165.34 N/A 1518.7 176.76 165.71 N/A
536870912 8388608 float none -1 2892.3 185.62 174.02 N/A 2894.4 185.49 173.89 N/A
1073741824 16777216 float none -1 5532.7 194.07 181.94 N/A 5530.7 194.14 182.01 N/A
2147483648 33554432 float none -1 10863 197.69 185.34 N/A 10837 198.17 185.78 N/A
4294967296 67108864 float none -1 21481 199.94 187.45 N/A 21466 200.08 187.58 N/A
8589934592 134217728 float none -1 42713 201.11 188.54 N/A 42701 201.16 188.59 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 146.948
```
=== "A3 High/Edge"
```yaml
type: task
name: nccl-tests
nodes: 2
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx
commands:
- |
export NCCL_DEBUG=INFO
export LD_LIBRARY_PATH=/usr/local/tcpx/lib64:$LD_LIBRARY_PATH
# We use FIFO for inter-node communication
FIFO=/tmp/dstack_job
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
mkdir -p /scripts/hostfiles2
: > /scripts/hostfiles2/hostfile8
for ip in ${DSTACK_NODES_IPS}; do
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> /scripts/hostfiles2/hostfile8
done
MPIRUN='mpirun --allow-run-as-root --hostfile /scripts/hostfiles2/hostfile8'
# Wait for other nodes
while true; do
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
break
fi
echo 'Waiting for nodes...'
sleep 5
done
# Run NCCL Tests
NCCL_GPUDIRECTTCPX_FORCE_ACK=0 /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 8M 8GB 2
# Notify nodes the job is done
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
else
mkfifo ${FIFO}
# Wait for a message from the first node
cat ${FIFO}
fi
spot_policy: auto
resources:
shm_size: 16GB
```
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f nccl-tests.dstack.yml
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 131072 float none -1 784.9 10.69 10.02 0 775.9 10.81 10.14 0
16777216 262144 float none -1 1010.3 16.61 15.57 0 999.3 16.79 15.74 0
33554432 524288 float none -1 1161.6 28.89 27.08 0 1152.9 29.10 27.28 0
67108864 1048576 float none -1 1432.6 46.84 43.92 0 1437.8 46.67 43.76 0
134217728 2097152 float none -1 2516.9 53.33 49.99 0 2491.7 53.87 50.50 0
268435456 4194304 float none -1 5066.8 52.98 49.67 0 5131.4 52.31 49.04 0
536870912 8388608 float none -1 10028 53.54 50.19 0 10149 52.90 49.60 0
1073741824 16777216 float none -1 20431 52.55 49.27 0 20214 53.12 49.80 0
2147483648 33554432 float none -1 40254 53.35 50.01 0 39923 53.79 50.43 0
4294967296 67108864 float none -1 80896 53.09 49.77 0 78875 54.45 51.05 0
8589934592 134217728 float none -1 160505 53.52 50.17 0 160117 53.65 50.29 0
Out of bounds values : 0 OK
Avg bus bandwidth : 40.6043
```
### Distributed training
=== "A4"
You can use the standard [distributed task](../../concepts/tasks.md#distributed-tasks) example to run distributed training on A4 instances.
=== "A3 Mega"
You can use the standard [distributed task](../../concepts/tasks.md#distributed-tasks) example to run distributed training on A3 Mega instances. To enable GPUDirect-TCPX, make sure the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl) are properly set, for example by adding the following commands at the beginning:
```shell
# ...
commands:
- |
NCCL_LIB_DIR="/var/lib/tcpxo/lib64"
source ${NCCL_LIB_DIR}/nccl-env-profile-ll128.sh
export NCCL_FASTRAK_CTRL_DEV=enp0s12
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices"
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
# ...
```
=== "A3 High/Edge"
You can use the standard [distributed task](../../concepts/tasks.md#distributed-tasks) example to run distributed training on A3 High/Edge instances. To enable GPUDirect-TCPX0, make sure the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl) are properly set, for example by adding the following commands at the beginning:
```shell
# ...
commands:
- |
export NCCL_DEBUG=INFO
NCCL_LIB_DIR="/usr/local/tcpx/lib64"
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
export NCCL_SOCKET_IFNAME=eth0
export NCCL_CROSS_NIC=0
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=1
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_P2P_PXN_LEVEL=0
export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
export NCCL_DYNAMIC_CHUNK_SIZE=524288
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_P2P_PCI_CHUNKSIZE=524288
export NCCL_P2P_NVL_CHUNKSIZE=1048576
export NCCL_BUFFSIZE=4194304
export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
# ...
```
In addition to distributed training, you can of course run regular tasks, dev environments, and services.
## What's new
1. Learn about [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), [services](../../concepts/services.md)
2. Read about [cluster placement](../../concepts/fleets.md#cluster-placement)
3. Check GCP's docs on using [A4](https://docs.cloud.google.com/compute/docs/gpus/create-gpu-vm-a3u-a4), and [A3 Mega/High/Edge](https://docs.cloud.google.com/compute/docs/gpus/gpudirect) instances
# docs/examples/clusters/lambda.md
---
title: Lambda
description: Setting up Lambda clusters using Kubernetes or 1-Click Clusters with fast interconnect
---
# Lambda
`dstack` allows using Lambda clusters with fast interconnect via two ways:
* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Lambda and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
* [VMs](#vms) – If you create a 1CC cluster on Lambda and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
## Kubernetes
### Prerequsisites
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s.
2. Go to `Firewall` → `Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
### Configure the backend
Follow the standard instructions for setting up a [Kubernetes](../../concepts/backends.md#kubernetes) backend:
```yaml
projects:
- name: main
backends:
- type: kubernetes
kubeconfig:
filename:
proxy_jump:
port: 30022
```
### Create a fleet
Once the Kubernetes cluster and the `dstack` server are running, you can create a fleet:
```yaml
type: fleet
name: lambda-fleet
placement: cluster
nodes: 0..
backends: [kubernetes]
resources:
# Specify requirements to filter nodes
gpu: 1..8
```
Pass the fleet configuration to `dstack apply`:
```shell
$ dstack apply -f lambda-fleet.dstack.yml
```
Once the fleet is created, you can run [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), and [services](../../concepts/services.md).
## 1-Click Clusters
Another way to work with Lambda clusters is through [1CC](https://lambda.ai/1-click-clusters). While `dstack` supports automated cluster provisioning via [VM-based backends](../../concepts/backends.md#vm-based), there is currently no programmatic way to provision Lambda 1CCs. As a result, to use a 1CC cluster with `dstack`, you must use [SSH fleets](../../concepts/fleets.md).
### Prerequsisites
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters
### Create a fleet
Follow the standard instructions for setting up an [SSH fleet](../../concepts/fleets.md#ssh-fleets):
```yaml
type: fleet
name: lambda-fleet
ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- worker-gpu-8x-b200-rplfm-ll9nr
- worker-gpu-8x-b200-rplfm-qrcs9
proxy_jump:
hostname: 192.222.55.54
user: ubuntu
identity_file: ~/.ssh/id_rsa
placement: cluster
```
> Under `proxy_jump`, we specify the hostname of the head node along with the private SSH key.
Pass the fleet configuration to `dstack apply`:
```shell
$ dstack apply -f lambda-fleet.dstack.yml
```
Once the fleet is created, you can run [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), and [services](../../concepts/services.md).
## Run tasks
To run tasks on a cluster, you must use [distributed tasks](../../concepts/tasks.md#distributed-task).
### Run NCCL tests
To validate cluster network bandwidth, use the following task:
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-x NCCL_IB_HCA=^mlx5_0 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
# Uncomment if the `kubernetes` backend requires it for `/dev/infiniband` access
#privileged: true
resources:
gpu: nvidia:B200:8
shm_size: 16GB
```
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f lambda-nccl-tests.dstack.yml
Provisioning...
---> 100%
nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602
Collective test starting: all_reduce_perf
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 36.50 0.00 0.00 0 36.16 0.00 0.00 0
16 4 float sum -1 35.55 0.00 0.00 0 35.49 0.00 0.00 0
32 8 float sum -1 35.49 0.00 0.00 0 36.28 0.00 0.00 0
64 16 float sum -1 35.85 0.00 0.00 0 35.54 0.00 0.00 0
128 32 float sum -1 37.36 0.00 0.01 0 36.82 0.00 0.01 0
256 64 float sum -1 37.38 0.01 0.01 0 37.80 0.01 0.01 0
512 128 float sum -1 51.05 0.01 0.02 0 37.17 0.01 0.03 0
1024 256 float sum -1 45.33 0.02 0.04 0 37.98 0.03 0.05 0
2048 512 float sum -1 38.67 0.05 0.10 0 38.30 0.05 0.10 0
4096 1024 float sum -1 40.08 0.10 0.19 0 39.18 0.10 0.20 0
8192 2048 float sum -1 42.13 0.19 0.36 0 41.47 0.20 0.37 0
16384 4096 float sum -1 43.66 0.38 0.70 0 41.94 0.39 0.73 0
32768 8192 float sum -1 45.42 0.72 1.35 0 43.29 0.76 1.42 0
65536 16384 float sum -1 44.59 1.47 2.76 0 43.90 1.49 2.80 0
131072 32768 float sum -1 47.44 2.76 5.18 0 46.79 2.80 5.25 0
262144 65536 float sum -1 66.68 3.93 7.37 0 65.36 4.01 7.52 0
524288 131072 float sum -1 240.71 2.18 4.08 0 125.73 4.17 7.82 0
1048576 262144 float sum -1 115.58 9.07 17.01 0 115.48 9.08 17.03 0
2097152 524288 float sum -1 114.44 18.33 34.36 0 114.27 18.35 34.41 0
4194304 1048576 float sum -1 118.25 35.47 66.50 0 117.11 35.82 67.15 0
8388608 2097152 float sum -1 141.39 59.33 111.24 0 134.95 62.16 116.55 0
16777216 4194304 float sum -1 186.86 89.78 168.34 0 184.39 90.99 170.60 0
33554432 8388608 float sum -1 255.79 131.18 245.96 0 253.88 132.16 247.81 0
67108864 16777216 float sum -1 350.41 191.52 359.09 0 350.71 191.35 358.79 0
134217728 33554432 float sum -1 596.75 224.92 421.72 0 595.37 225.44 422.69 0
268435456 67108864 float sum -1 934.67 287.20 538.50 0 931.37 288.22 540.41 0
536870912 134217728 float sum -1 1625.63 330.25 619.23 0 1687.31 318.18 596.59 0
1073741824 268435456 float sum -1 2972.25 361.26 677.35 0 2971.33 361.37 677.56 0
2147483648 536870912 float sum -1 5784.75 371.23 696.06 0 5728.40 374.88 702.91 0
Out of bounds values : 0 OK
Avg bus bandwidth : 137.179
```
## What's next
1. Learn about [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), [services](../../concepts/services.md)
2. Read about the [Kubernetes backend](../../concepts/backends.md#kubernetes) and [cluster placement](../../concepts/fleets.md#cluster-placement)
3. Check Lambda's docs on [Kubernetes](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) and [1CC](https://docs.lambda.ai/public-cloud/1-click-clusters/)
# docs/examples/clusters/crusoe.md
---
title: Crusoe
description: Using Crusoe clusters with InfiniBand support via VMs or Kubernetes
---
# Crusoe
`dstack` allows using Crusoe clusters with fast interconnect via two ways:
* [VMs](#vms) – If you configure a `crusoe` backend in `dstack` by providing your Crusoe credentials, `dstack` lets you fully provision and use clusters through `dstack`.
* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Crusoe and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
## VMs
Since `dstack` offers a VM-based backend that natively integrates with Crusoe, you only need to provide your Crusoe credentials to `dstack`, and it will allow you to fully provision and use clusters on Crusoe through `dstack`.
### Configure a backend
Log into your [Crusoe](https://console.crusoecloud.com/) console, create an API key under your account settings, and note your project ID.
```yaml
projects:
- name: main
backends:
- type: crusoe
project_id: your-project-id
creds:
type: access_key
access_key: your-access-key
secret_key: your-secret-key
```
### Create a fleet
Once the backend is configured, you can create a fleet:
```yaml
type: fleet
name: crusoe-fleet
nodes: 2
placement: cluster
backends: [crusoe]
resources:
gpu: A100:80GB:8
```
Pass the fleet configuration to `dstack apply`:
```shell
$ dstack apply -f crusoe-fleet.dstack.yml
```
This will automatically create an IB partition and provision instances with InfiniBand networking.
Once the fleet is created, you can run [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), and [services](../../concepts/services.md).
> If you want instances to be provisioned on demand, you can set `nodes` to `0..2`. In this case, `dstack` will create instances only when you run workloads.
## Kubernetes
### Create a cluster
1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance, and `Desired Number of Nodes`.
4. Wait until nodes are provisioned.
> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned.
### Configure the backend
Follow the standard instructions for setting up a [`kubernetes`](../../concepts/backends.md#kubernetes) backend:
```yaml
projects:
- name: main
backends:
- type: kubernetes
kubeconfig:
filename:
proxy_jump:
port: 30022
```
### Create a fleet
Once the Crusoe Managed Kubernetes cluster and the `dstack` server are running, you can create a fleet:
```yaml
type: fleet
name: crusoe-fleet
placement: cluster
nodes: 0..
backends: [kubernetes]
resources:
# Specify requirements to filter nodes
gpu: 8
```
Pass the fleet configuration to `dstack apply`:
```shell
$ dstack apply -f crusoe-fleet.dstack.yml
```
Once the fleet is created, you can run [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), and [services](../../concepts/services.md).
## NCCL tests
Use a [distributed task](../../concepts/tasks.md#distributed-tasks) that runs NCCL tests to validate cluster network bandwidth.
=== "VMs"
With the Crusoe backend, HPC-X and NCCL topology files are pre-installed on the host VM image. Mount them into the container via [instance volumes](../../concepts/volumes.md#instance-volumes).
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
volumes:
- /opt/hpcx:/opt/hpcx
- /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo
commands:
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_HCA=^mlx5_0:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
backends: [crusoe]
resources:
gpu: A100:80GB:8
shm_size: 16GB
```
> Update `NCCL_TOPO_FILE` to match your instance type. Topology files for all supported types are available at `/etc/crusoe/nccl_topo/` on the host.
=== "Kubernetes"
If you're running on Crusoe Managed Kubernetes, make sure to install HPC-X and provide an up-to-date topology file.
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
commands:
# Install NCCL topology files
- curl -sSL https://gist.github.com/un-def/48df8eea222fa9547ad4441986eb15af/archive/df51d56285c5396a0e82bb42f4f970e7bb0a9b65.tar.gz -o nccl_topo.tar.gz
- mkdir -p /etc/crusoe/nccl_topo
- tar -C /etc/crusoe/nccl_topo -xf nccl_topo.tar.gz --strip-components=1
# Install and initialize HPC-X
- curl -sSL https://content.mellanox.com/hpc/hpc-x/v2.21.3/hpcx-v2.21.3-gcc-doca_ofed-ubuntu22.04-cuda12-x86_64.tbz -o hpcx.tar.bz
- mkdir -p /opt/hpcx
- tar -C /opt/hpcx -xf hpcx.tar.bz --strip-components=1 --checkpoint=10000
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
# Run NCCL Tests
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_AR_THRESHOLD=0 \
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
-x NCCL_IB_QPS_PER_CONNECTION=2 \
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
-x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
# Required for IB
privileged: true
resources:
gpu: A100:8
shm_size: 16GB
```
> The task above downloads an A100 topology file from a Gist. The most reliable way to obtain the latest topology is to copy it from a Crusoe-provisioned VM (see [VMs](#vms)).
??? info "Privileged"
When running on Crusoe Managed Kubernetes, set `privileged` to `true` to ensure access to InfiniBand.
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f crusoe-nccl-tests.dstack.yml
```
## What's next
1. Learn about [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), [services](../../concepts/services.md)
2. Check out [backends](../../concepts/backends.md#crusoe-cloud) and [fleets](../../concepts/fleets.md#cloud-fleets)
3. Check the docs on [Crusoe's networking](https://docs.crusoecloud.com/networking/infiniband/) and ["Crusoe Managed" Kubernetes](https://docs.crusoecloud.com/orchestration/cmk/index.html)
# docs/examples/clusters/nebius.md
---
title: Nebius
description: Using Nebius clusters with InfiniBand support via VMs or Kubernetes
---
# Nebius
`dstack` allows you to use Nebius clusters with fast interconnects in two ways:
* [VMs](#vms) – If you configure a `nebius` backend in `dstack` by providing your Nebius credentials, `dstack` lets you fully provision and use clusters through `dstack`.
* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Nebius and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
## VMs
Since `dstack` offers a VM-based backend that natively integrates with Nebius, you only need to provide your Nebius credentials to `dstack`, and it will allow you to fully provision and use clusters on Nebius through `dstack`.
### Configure a backend
You can configure the `nebius` backend using a credentials file [generated](https://docs.nebius.com/iam/service-accounts/authorized-keys#create) by the `nebius` CLI:
```shell
$ nebius iam auth-public-key generate \
--service-account-id <service account ID> \
--output ~/.nebius/sa-credentials.json
```
```yaml
projects:
- name: main
backends:
- type: nebius
creds:
type: service_account
filename: ~/.nebius/sa-credentials.json
```
### Create a fleet
Once the backend configured, you can create a fleet:
```yaml
type: fleet
name: nebius-fleet
nodes: 2
placement: cluster
backends: [nebius]
resources:
gpu: H100:8
```
Pass the fleet configuration to `dstack apply`:
```shell
$ dstack apply -f nebius-fleet.dstack.yml
```
This will automatically create a Nebius cluster and provision instances.
Once the fleet is created, you can run [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), and [services](../../concepts/services.md).
> If you want instances to be provisioned on demand, you can set `nodes` to `0..2`. In this case, `dstack` will create instances only when you run workloads.
## Kubernetes
If, for some reason, you’d like to use dstack with Nebius’s managed Kubernetes service, you can point `dstack` to the cluster’s kubeconfig file, and `dstack` will allow you to fully use this cluster through `dstack`.
### Create a cluster
1. Go to `Compute` → `Kubernetes` and click `Create cluster`. Make sure to enable `Public endpoint`.
2. Go to `Node groups` and click `Create node group`. Make sure to enable `Assign public IPv4 addresses` and `Install NVIDIA GPU drivers and other components`. Select the appropriate instance type, specify the `Number of nodes`, and set `Node storage` to at least `120 GiB`. Make sure to click `Create` under `GPU cluster` if you plan to use a fast interconnect.
3. Go to `Applications`, find `NVIDIA Device Plugin`, and click `Deploy`.
4. Wait until the nodes are provisioned.
> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned. To provision instances on demand, use [VMs](#vms) (see above).
#### Configure the kubeconfig file
1. Click `How to connect` and copy the `nebius` CLI command that configures the `kubeconfig` file.
2. Install the `nebius` CLI and run the command:
```shell
$ nebius mk8s cluster get-credentials --id <cluster id> --external
```
### Configure a backend
Follow the standard instructions for setting up a [`kubernetes`](../../concepts/backends.md#kubernetes) backend:
```yaml
projects:
- name: main
backends:
- type: kubernetes
kubeconfig:
filename:
```
### Create a fleet
Once the cluster and the `dstack` server are running, you can create a fleet:
```yaml
type: fleet
name: nebius-fleet
placement: cluster
nodes: 0..
backends: [kubernetes]
resources:
# Specify requirements to filter nodes
gpu: 8
```
Pass the fleet configuration to `dstack apply`:
```shell
$ dstack apply -f nebius-fleet.dstack.yml
```
Once the fleet is created, you can run [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), and [services](../../concepts/services.md).
## NCCL tests
Use a [distributed task](../../concepts/tasks.md#distributed-tasks) to run NCCL tests and validate the cluster’s network bandwidth.
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
# Required for `/dev/infiniband` access
privileged: true
resources:
gpu: 8
shm_size: 16GB
```
Pass the configuration to `dstack apply`:
```shell
$ dstack apply -f nebius-nccl-tests.dstack.yml
Provisioning...
---> 100%
nccl-tests provisioning completed (running)
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 45.72 0.00 0.00 0 29.78 0.00 0.00 0
16 4 float sum -1 29.92 0.00 0.00 0 29.42 0.00 0.00 0
32 8 float sum -1 30.10 0.00 0.00 0 29.75 0.00 0.00 0
64 16 float sum -1 34.48 0.00 0.00 0 29.36 0.00 0.00 0
128 32 float sum -1 30.38 0.00 0.01 0 29.67 0.00 0.01 0
256 64 float sum -1 30.48 0.01 0.02 0 29.97 0.01 0.02 0
512 128 float sum -1 30.45 0.02 0.03 0 30.85 0.02 0.03 0
1024 256 float sum -1 31.36 0.03 0.06 0 31.29 0.03 0.06 0
2048 512 float sum -1 32.27 0.06 0.12 0 32.26 0.06 0.12 0
4096 1024 float sum -1 36.04 0.11 0.21 0 43.17 0.09 0.18 0
8192 2048 float sum -1 37.24 0.22 0.41 0 35.54 0.23 0.43 0
16384 4096 float sum -1 37.22 0.44 0.83 0 34.55 0.47 0.89 0
32768 8192 float sum -1 43.82 0.75 1.40 0 35.64 0.92 1.72 0
65536 16384 float sum -1 37.85 1.73 3.25 0 37.55 1.75 3.27 0
131072 32768 float sum -1 43.10 3.04 5.70 0 53.08 2.47 4.63 0
262144 65536 float sum -1 58.59 4.47 8.39 0 63.33 4.14 7.76 0
524288 131072 float sum -1 97.88 5.36 10.04 0 83.91 6.25 11.72 0
1048576 262144 float sum -1 87.08 12.04 22.58 0 77.82 13.47 25.26 0
2097152 524288 float sum -1 99.06 21.17 39.69 0 97.67 21.47 40.26 0
4194304 1048576 float sum -1 110.14 38.08 71.40 0 114.66 36.58 68.59 0
8388608 2097152 float sum -1 154.48 54.30 101.82 0 156.03 53.76 100.80 0
16777216 4194304 float sum -1 210.33 79.77 149.56 0 200.98 83.48 156.52 0
33554432 8388608 float sum -1 274.23 122.36 229.43 0 276.45 121.38 227.58 0
67108864 16777216 float sum -1 472.43 142.05 266.35 0 480.00 139.81 262.14 0
134217728 33554432 float sum -1 759.58 176.70 331.31 0 756.21 177.49 332.79 0
268435456 67108864 float sum -1 1305.66 205.59 385.49 0 1303.37 205.95 386.16 0
536870912 134217728 float sum -1 2379.38 225.63 423.06 0 2373.42 226.20 424.13 0
1073741824 268435456 float sum -1 4511.97 237.98 446.21 0 4513.82 237.88 446.02 0
2147483648 536870912 float sum -1 8776.26 244.69 458.80 0 8760.42 245.13 459.63 0
4294967296 1073741824 float sum -1 17407.8 246.73 462.61 0 17302.2 248.23 465.44 0
8589934592 2147483648 float sum -1 34448.4 249.36 467.54 0 34381.0 249.85 468.46 0
Out of bounds values : 0 OK
Avg bus bandwidth : 125.499
Collective test concluded: all_reduce_perf
```
## What's next
1. Learn about [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), [services](../../concepts/services.md)
2. Check out [backends](../../concepts/backends.md) and [fleets](../../concepts/fleets.md)
3. Read Nebius' docs on [networking for VMs](https://docs.nebius.com/compute/clusters/gpu) and the [managed Kubernetes service](https://docs.nebius.com/kubernetes).
# docs/examples/clusters/nccl-rccl-tests.md
---
title: NCCL/RCCL tests
description: Running NCCL and RCCL tests to validate cluster network bandwidth
---
# NCCL/RCCL tests
This example shows how to run [NCCL](https://github.com/NVIDIA/nccl-tests) or [RCCL](https://github.com/ROCm/rccl-tests) tests on a cluster using [distributed tasks](../../concepts/tasks.md#distributed-tasks).
!!! info "Prerequisites"
Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](../../concepts/fleets.md#cluster-placement) or an [SSH fleet](../../concepts/fleets.md#ssh-placement)).
## Running as a task
Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).
=== "NCCL tests"
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
# Uncomment if the `kubernetes` backend requires it for `/dev/infiniband` access
#privileged: true
resources:
gpu: nvidia:1..8
shm_size: 16GB
```
!!! info "Default image"
If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with
`uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`).
=== "RCCL tests"
```yaml
type: task
name: rccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
# Mount the system libraries folder from the host
volumes:
- /usr/local/lib:/mnt/lib
image: rocm/dev-ubuntu-22.04:6.4-complete
env:
- NCCL_DEBUG=INFO
- OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
commands:
# Setup MPI and build RCCL tests
- apt-get install -y git libopenmpi-dev openmpi-bin
- git clone https://github.com/ROCm/rccl-tests.git
- cd rccl-tests
- make MPI=1 MPI_HOME=$OPEN_MPI_HOME
# Preload the RoCE driver library from the host (for Broadcom driver compatibility)
- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
# Run RCCL tests via MPI
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun --allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--mca btl_tcp_if_include ens41np0 \
-x LD_PRELOAD \
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
else
sleep infinity
fi
resources:
gpu: MI300X:8
```
!!! info "RoCE library"
Broadcom RoCE drivers require the `libbnxt_re` userspace library inside the container to be compatible with the host’s Broadcom
kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it
using `LD_PRELOAD` when running MPI.
!!! info "Privileged"
In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand).
### Apply a configuration
To run a configuration, use the [`dstack apply`](../../reference/cli/dstack/apply.md) command.
```shell
$ dstack apply -f nccl-tests.dstack.yml
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 aws us-east-1 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
2 aws us-west-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
3 aws us-east-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
Submit the run nccl-tests? [y/n]: y
```
## What's next?
1. Check [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md),
[services](../../concepts/services.md), and [fleets](../../concepts/fleets.md).
# docs/examples/inference/sglang.md
---
title: SGLang
description: Deploying Qwen3.6-27B using SGLang on NVIDIA and AMD GPUs
---
# SGLang
This example shows how to deploy `Qwen/Qwen3.6-27B` using
[SGLang](https://github.com/sgl-project/sglang) and `dstack`.
> For a `DeepSeek-V4-Pro` deployment on `B200:8`, see the
[DeepSeek V4](../models/deepseek-v4.md) model page.
## Apply a configuration
Here's an example of a service that deploys
`Qwen/Qwen3.6-27B` using SGLang.
=== "NVIDIA"
```yaml
type: service
name: qwen36
image: lmsysorg/sglang:v0.5.10.post1
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
shm_size: 16GB
gpu: H100:4
```
=== "AMD"
```yaml
type: service
name: qwen36
image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4
```
The AMD example keeps the deployment close to the upstream Qwen and SGLang
guidance: a pinned ROCm image, tensor parallelism across all four GPUs, and the
standard `qwen3` reasoning parser without extra ROCm-specific tuning flags.
Save one of the configurations above as `service.dstack.yml`, then use the
[`dstack apply`](../../reference/cli/dstack/apply.md) command.
```shell
$ dstack apply -f service.dstack.yml
```
If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell
curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3.6-27B",
"messages": [
{
"role": "user",
"content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with just the dollar amount."
}
],
"separate_reasoning": true,
"max_tokens": 1024
}'
```
Qwen3.6 uses thinking mode by default. To disable thinking, pass
`"chat_template_kwargs": {"enable_thinking": false}` in the request body. To
enable tool calling, add `--tool-call-parser qwen3_coder` to the serve command.
> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen36./`.
## Configuration options
### PD disaggregation
To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), use replica groups: one for [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html), one for prefill workers, and one for decode workers.
=== "NVIDIA"
```yaml
type: service
name: prefill-decode
image: lmsysorg/sglang:v0.5.10.post1
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install smg
- |
smg launch \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
resources:
cpu: 4
router:
type: sglang
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15s
```
> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
!!! info "Cluster"
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
## What's next?
1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)
2. Browse the [Qwen 3.6 SGLang cookbook](https://docs.sglang.io/cookbook/autoregressive/Qwen/Qwen3.6) and the [SGLang server arguments reference](https://docs.sglang.ai/advanced_features/server_arguments.html)
# docs/examples/inference/dynamo.md
---
title: NVIDIA Dynamo
description: Deploying zai-org/GLM-4.5-Air-FP8 using NVIDIA Dynamo with Prefill-Decode disaggregation.
---
# Dynamo
This example shows how to deploy `zai-org/GLM-4.5-Air-FP8` using
[NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) and `dstack`.
## Apply a configuration
Here's an example of a service that deploys `zai-org/GLM-4.5-Air-FP8` using
Dynamo with PD disaggregation.
```yaml
type: service
name: dynamo-pd
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
replicas:
- count: 1
docker: true
commands:
- apt-get update
- apt-get install -y python3-dev python3-venv
- python3 -m venv ~/dyn-venv
- source ~/dyn-venv/bin/activate
- pip install -U pip
- pip install "ai-dynamo[sglang]==1.1.1"
- git clone https://github.com/ai-dynamo/dynamo.git
# Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend.
- docker compose -f dynamo/deploy/docker-compose.yml up -d
- |
python3 -m dynamo.frontend \
--http-host 0.0.0.0 --http-port 8000 \
--discovery-backend etcd --router-mode kv \
--kv-cache-block-size 64
resources:
cpu: 4
router:
type: dynamo
- count: 1..4
scaling:
metric: rps
target: 3
python: "3.12"
nvcc: true
commands:
# dstack injects DSTACK_ROUTER_INTERNAL_IP after the router replica
# is provisioned. Compose the etcd/NATS endpoints from it.
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
# Set to enable /health endpoint required by dstack probes.
- export DYN_SYSTEM_PORT="8000"
# Wait until the router's etcd and NATS ports are actually accepting connections.
- |
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
done
- pip install "ai-dynamo[sglang]==1.1.1"
- |
python3 -m dynamo.sglang \
--model-path $MODEL_ID --served-model-name $MODEL_ID \
--discovery-backend etcd --host 0.0.0.0 \
--page-size 64 \
--disaggregation-mode prefill --disaggregation-transfer-backend nixl
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
python: "3.12"
nvcc: true
commands:
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
- export DYN_SYSTEM_PORT="8000"
- |
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
done
- pip install "ai-dynamo[sglang]==1.1.1"
- |
python3 -m dynamo.sglang \
--model-path $MODEL_ID --served-model-name $MODEL_ID \
--discovery-backend etcd --host 0.0.0.0 \
--page-size 64 \
--disaggregation-mode decode --disaggregation-transfer-backend nixl
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15s
```
> With the the `dynamo` router, you can use SGLang, vLLM, and TensorRT-LLM prefill and decode workers.
Save the configuration as `service.dstack.yml`, then use the
[`dstack apply`](../../reference/cli/dstack/apply.md) command.
```shell
$ dstack apply -f service.dstack.yml
```
If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell
curl http://127.0.0.1:3000/proxy/services/main/dynamo-pd/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "zai-org/GLM-4.5-Air-FP8",
"messages": [
{
"role": "user",
"content": "What is prefill-decode disaggregation?"
}
],
"max_tokens": 1024
}'
```
> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://dynamo-pd./`.
## Configuration options
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
!!! info "Cluster"
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
## What's next?
1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)
2. Browse the [NVIDIA Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo) and the [SGLang](./sglang.md) example
# docs/examples/inference/vllm.md
---
title: vLLM
description: Deploying Qwen3.6-27B using vLLM on NVIDIA and AMD GPUs
---
# vLLM
This example shows how to deploy `Qwen/Qwen3.6-27B` using
[vLLM](https://docs.vllm.ai/en/latest/) and `dstack`.
## Apply a configuration
Here's an example of a service that deploys
`Qwen/Qwen3.6-27B` using vLLM.
=== "NVIDIA"
```yaml
type: service
name: qwen36
image: vllm/vllm-openai:v0.19.1
commands:
- |
vllm serve Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len 262144 \
--reasoning-parser qwen3
port: 8000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
shm_size: 16GB
gpu: H100:4
```
=== "AMD"
```yaml
type: service
name: qwen36
image: vllm/vllm-openai-rocm:v0.19.1
commands:
- |
vllm serve Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len 262144 \
--reasoning-parser qwen3
port: 8000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4
```
Qwen3.6-27B is a multimodal model. For text-only workloads, add
`--language-model-only` to free more memory for the KV cache. To enable tool
calling, add `--enable-auto-tool-choice --tool-call-parser qwen3_coder`.
Save one of the configurations above as `service.dstack.yml`, then use the
[`dstack apply`](../../reference/cli/dstack/apply.md) command.
```shell
$ dstack apply -f service.dstack.yml
```
If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell
curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3.6-27B",
"messages": [
{
"role": "user",
"content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"
}
],
"max_tokens": 1024
}'
```
> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen36./`.
## What's next?
1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)
2. Browse the [Qwen 3.5 & 3.6 vLLM recipe](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html) and the [SGLang](../inference/sglang.md) example
# docs/examples/inference/nim.md
---
title: NVIDIA NIM
description: Deploying Nemotron-3-Super-120B-A12B using NVIDIA NIM
---
# NVIDIA NIM
This example shows how to deploy Nemotron-3-Super-120B-A12B using [NVIDIA NIM](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) and `dstack`.
??? info "Prerequisites"
Once `dstack` is [installed](../../installation.md), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
```
## Deployment
Here's an example of a service that deploys Nemotron-3-Super-120B-A12B using NIM.
```yaml
type: service
name: nemotron120
image: nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:1.8.0
env:
- NGC_API_KEY
registry_auth:
username: $oauthtoken
password: ${{ env.NGC_API_KEY }}
port: 8000
model: nvidia/nemotron-3-super-120b-a12b
volumes:
- instance_path: /root/.cache/nim
path: /opt/nim/.cache
optional: true
resources:
cpu: x86:96..
memory: 512GB..
shm_size: 16GB
disk: 500GB..
gpu: H100:80GB:8
```
### Running a configuration
Save the configuration above as `nemotron120.dstack.yml`, then use the
[`dstack apply`](../../reference/cli/dstack/apply.md) command.
```shell
$ NGC_API_KEY=...
$ dstack apply -f service.dstack.yml
```
If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell
$ curl http://127.0.0.1:3000/proxy/services/main/nemotron120/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-super-120b-a12b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
```
When a [gateway](../../concepts/gateways.md) is configured, the service endpoint will be available at `https://nemotron120./`.
## What's next?
1. Check [services](../../concepts/services.md)
2. Browse the [Nemotron-3-Super-120B-A12B model page](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b)
# docs/examples/inference/trtllm.md
---
title: TensorRT-LLM
description: Deploying Qwen3-235B-A22B-FP8 using NVIDIA TensorRT-LLM on NVIDIA GPUs
---
# TensorRT-LLM
This example shows how to deploy `nvidia/Qwen3-235B-A22B-FP8` using
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and `dstack`.
## Apply a configuration
Here's an example of a service that deploys
`nvidia/Qwen3-235B-A22B-FP8` using TensorRT-LLM.
```yaml
type: service
name: qwen235
image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc11
env:
- HF_HUB_ENABLE_HF_TRANSFER=1
commands:
- pip install hf_transfer
- |
trtllm-serve serve nvidia/Qwen3-235B-A22B-FP8 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--tp_size $DSTACK_GPUS_NUM \
--max_batch_size 32 \
--max_num_tokens 4096 \
--kv_cache_free_gpu_memory_fraction 0.75
port: 8000
model: nvidia/Qwen3-235B-A22B-FP8
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 96..
memory: 512GB..
shm_size: 32GB
disk: 1000GB..
gpu: H100:8
```
Apply it with [`dstack apply`](../../reference/cli/dstack/apply.md):
```shell
$ dstack apply -f service.dstack.yml
```
## Access the endpoint
If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell
$ curl http://127.0.0.1:3000/proxy/services/main/qwen235/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Qwen3-235B-A22B-FP8",
"messages": [
{
"role": "user",
"content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"
}
],
"chat_template_kwargs": {"enable_thinking": true},
"max_tokens": 1024,
"temperature": 0.0
}'
```
When a [gateway](../../concepts/gateways.md) is configured, the service endpoint will be available at `https://qwen235./`.
## What's next?
1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)
2. Browse the [TensorRT-LLM deployment guides](https://nvidia.github.io/TensorRT-LLM/deployment-guide/index.html) and the [Qwen3 deployment guide](https://nvidia.github.io/TensorRT-LLM/deployment-guide/deployment-guide-for-qwen3-on-trtllm.html)
3. See the [`trtllm-serve` reference](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve/trtllm-serve.html)
# docs/examples/models/deepseek-v4.md
---
title: DeepSeek V4
description: Deploying DeepSeek-V4-Pro using SGLang on NVIDIA B200:8
---
# DeepSeek V4
This example shows how to deploy `deepseek-ai/DeepSeek-V4-Pro` as a
[service](../../concepts/services.md) using
[SGLang](https://github.com/sgl-project/sglang) and `dstack`.
## Apply a configuration
Save the following configuration as `deepseek-v4.dstack.yml`.
```yaml
type: service
name: deepseek-v4
image: lmsysorg/sglang:deepseek-v4-blackwell
env:
- HF_TOKEN
- SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
- SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
commands:
- |
sglang serve \
--trust-remote-code \
--model-path deepseek-ai/DeepSeek-V4-Pro \
--tp 8 \
--dp 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.82 \
--cuda-graph-max-bs 64 \
--max-running-requests 256 \
--deepep-config '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' \
--tool-call-parser deepseekv4 \
--reasoning-parser deepseek-v4 \
--host 0.0.0.0 \
--port 30000
port: 30000
model: deepseek-ai/DeepSeek-V4-Pro
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
gpu: B200:8
shm_size: 32GB
disk: 2TB..
```
This configuration uses the single-node Blackwell `DeepSeek-V4-Pro` recipe
shape for `8 x NVIDIA B200`.
Export your Hugging Face token and apply the configuration with
[`dstack apply`](../../reference/cli/dstack/apply.md).
```shell
$ export HF_TOKEN=
$ dstack apply -f deepseek-v4.dstack.yml
```
If no gateway is created, the service endpoint will be available at
`/proxy/services///`.
```shell
curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [
{
"role": "user",
"content": "What is 15% of 240? Reply with just the number."
}
],
"temperature": 0,
"max_tokens": 32
}'
```
## Reasoning mode
To separate the model's reasoning into `reasoning_content`, keep
`--reasoning-parser deepseek-v4` in the server command and send
`chat_template_kwargs` in the request body.
For raw HTTP requests, `chat_template_kwargs` and `separate_reasoning` must be
top-level JSON fields.
```shell
curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [
{
"role": "user",
"content": "Solve step by step: If 3x + 5 = 20, what is x?"
}
],
"temperature": 0,
"max_tokens": 256,
"chat_template_kwargs": {
"thinking": true
},
"separate_reasoning": true
}'
```
This returns both:
- `reasoning_content`: a separate reasoning trace
- `content`: the final user-visible answer
## Deployment notes
- The first startup can take several minutes while the model loads and SGLang
finishes initialization.
- The optional `/root/.cache` instance volume helps reuse the model cache on
backends that support instance volumes.
## What's next?
1. Read the [DeepSeek-V4-Pro model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)
2. Read the [DeepSeek-V4 SGLang cookbook](https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4)
3. Browse the dedicated [SGLang](../inference/sglang.md) and [vLLM](../inference/vllm.md) examples
# docs/examples/models/qwen36.md
---
title: Qwen 3.6
description: Deploying Qwen3.6-27B using SGLang on NVIDIA and AMD GPUs
---
# Qwen 3.6
This example shows how to deploy `Qwen/Qwen3.6-27B` as a
[service](../../concepts/services.md) using
[SGLang](https://github.com/sgl-project/sglang) and `dstack`.
## Apply a configuration
Save one of the following configurations as `qwen36.dstack.yml`.
=== "NVIDIA"
```yaml
type: service
name: qwen36
image: lmsysorg/sglang:v0.5.10.post1
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
shm_size: 16GB
gpu: H100:4
```
=== "AMD"
```yaml
type: service
name: qwen36
image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4
```
The NVIDIA and AMD configurations above use pinned SGLang images and the same
straightforward 4-GPU layout used across the Qwen 3.6 docs and examples.
Apply the configuration with
[`dstack apply`](../../reference/cli/dstack/apply.md).
```shell
$ dstack apply -f qwen36.dstack.yml
```
If no gateway is created, the service endpoint will be available at
`/proxy/services///`.
```shell
curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3.6-27B",
"messages": [
{
"role": "user",
"content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with just the dollar amount."
}
],
"max_tokens": 1024
}'
```
## Thinking mode
Qwen3.6 uses thinking mode by default. With SGLang, the reasoning stream is
returned separately as `reasoning_content`.
To disable thinking, pass `chat_template_kwargs` in the request body.
```shell
curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3.6-27B",
"messages": [
{
"role": "user",
"content": "Summarize the benefits of container images in one sentence."
}
],
"max_tokens": 256,
"chat_template_kwargs": {
"enable_thinking": false
}
}'
```
## What's next?
1. Read the [Qwen/Qwen3.6-27B model card](https://huggingface.co/Qwen/Qwen3.6-27B)
2. Read the [Qwen 3.6 SGLang cookbook](https://docs.sglang.io/cookbook/autoregressive/Qwen/Qwen3.6)
3. Read the [Qwen 3.5 & 3.6 vLLM recipe](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html)
4. Browse the dedicated [SGLang](../inference/sglang.md)
and [vLLM](../inference/vllm.md) examples
5. Check the [AMD](../accelerators/amd.md) example for
more AMD deployment and training configurations
# docs/examples/accelerators/amd.md
---
title: AMD
description: Deploying and fine-tuning models on AMD MI300X GPUs using SGLang, vLLM, TRL, and Axolotl
---
# AMD
`dstack` supports running dev environments, tasks, and services on AMD GPUs.
You can do that by setting up an [SSH fleet](../../concepts/fleets.md#ssh-fleets)
with on-prem AMD GPUs or configuring a backend that offers AMD GPUs such as the `runpod` backend.
## Deployment
Here are examples of a [service](../../concepts/services.md) that deploy
`Qwen/Qwen3.6-27B` on AMD MI300X GPUs using
[SGLang](https://github.com/sgl-project/sglang) and
[vLLM](https://docs.vllm.ai/en/latest/).
=== "SGLang"
```yaml
type: service
name: qwen36-service-sglang-amd
image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4
```
=== "vLLM"
```yaml
type: service
name: qwen36-service-vllm-amd
image: vllm/vllm-openai-rocm:v0.19.1
commands:
- |
vllm serve Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len 262144 \
--reasoning-parser qwen3
port: 8000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4
```
!!! info "Docker image"
AMD workloads require specifying an image with ROCm-compatible userspace and
framework packages. The SGLang and vLLM examples above use pinned ROCm
images.
If you already have a ROCm-compatible image, use it. Otherwise, choose an
image for the framework you use from
[ROCm Docker images](https://hub.docker.com/u/rocm), e.g. `rocm/sgl-dev`
for SGLang, `rocm/vllm` for vLLM, or `rocm/pytorch` for PyTorch. For
generic AMD dev environments or tasks, use `rocm/dev-ubuntu-24.04`.
To request multiple GPUs, specify the quantity after the GPU name, separated by a colon, e.g., `MI300X:4`.
## Fine-tuning
> If you're planning multi-node AMD training, validate cluster networking first
with the [NCCL/RCCL tests](../clusters/nccl-rccl-tests.md)
example.
=== "TRL"
Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html)
and the [`mlabonne/guanaco-llama2-1k`](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)
dataset.
```yaml
type: task
name: trl-amd-llama31-train
# Using Runpod's ROCm Docker image
image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04
# Required environment variables
env:
- HF_TOKEN
# Mount files
files:
- train.py
# Commands of the task
commands:
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
- git clone https://github.com/ROCm/bitsandbytes
- cd bitsandbytes
- git checkout rocm_enabled
- pip install -r requirements-dev.txt
- cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
- make
- pip install .
- pip install trl
- pip install peft
- pip install transformers datasets huggingface-hub scipy
- cd ..
- python train.py
# Uncomment to leverage spot instances
#spot_policy: auto
resources:
gpu: MI300X
disk: 150GB
```
=== "Axolotl"
Below is an example of fine-tuning Llama 3.1 8B using [Axolotl](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html)
and the [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
dataset.
```yaml
type: task
# The name is optional, if not specified, generated randomly
name: axolotl-amd-llama31-train
# Using Runpod's ROCm Docker image
image: runpod/pytorch:2.1.2-py3.10-rocm6.0.2-ubuntu22.04
# Required environment variables
env:
- HF_TOKEN
- WANDB_API_KEY
- WANDB_PROJECT
- WANDB_NAME=axolotl-amd-llama31-train
- HUB_MODEL_ID
# Commands of the task
commands:
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
- pip uninstall torch torchvision torchaudio -y
- python3 -m pip install --pre torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0/
- git clone https://github.com/OpenAccess-AI-Collective/axolotl
- cd axolotl
- git checkout d4f6c65
- pip install -e .
# Latest pynvml is not compatible with axolotl commit d4f6c65, so we need to fall back to version 11.5.3
- pip uninstall pynvml -y
- pip install pynvml==11.5.3
- cd ..
- wget https://dstack-binaries.s3.amazonaws.com/flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
- pip install flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
- wget https://dstack-binaries.s3.amazonaws.com/xformers-0.0.26-cp310-cp310-linux_x86_64.whl
- pip install xformers-0.0.26-cp310-cp310-linux_x86_64.whl
- git clone --recurse https://github.com/ROCm/bitsandbytes
- cd bitsandbytes
- git checkout rocm_enabled
- pip install -r requirements-dev.txt
- cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
- make
- pip install .
- cd ..
- accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml
--wandb-project "$WANDB_PROJECT"
--wandb-name "$WANDB_NAME"
--hub-model-id "$HUB_MODEL_ID"
resources:
gpu: MI300X
disk: 150GB
```
Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile)](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm).
> To speed up installation of `flash-attention` and `xformers`, we use pre-built binaries uploaded to S3.
## Running a configuration
Once a configuration is ready, save it to a `.dstack.yml` file. If your
configuration references environment variables such as `HF_TOKEN` or
`WANDB_API_KEY`, export them first. Then run
`dstack apply -f `, and `dstack` will automatically
provision the cloud resources and run the configuration.
```shell
$ dstack apply -f
```
## What's next?
1. Browse the dedicated [SGLang](../inference/sglang.md)
and [vLLM](../inference/vllm.md) examples, plus
[Axolotl](https://github.com/ROCm/rocm-blogs/tree/release/blogs/artificial-intelligence/axolotl),
[TRL](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.html),
and [ROCm Bitsandbytes](https://github.com/ROCm/bitsandbytes)
2. For multi-node training, run
[NCCL/RCCL tests](../clusters/nccl-rccl-tests.md)
to validate AMD cluster networking.
3. Check [dev environments](../../concepts/dev-environments.md),
[tasks](../../concepts/tasks.md), and
[services](../../concepts/services.md).
# docs/examples/accelerators/tpu.md
---
title: TPU
description: Deploying and fine-tuning models on Google Cloud TPUs using Optimum TPU and vLLM
---
# TPU
If you've configured the `gcp` backend in `dstack`, you can run dev environments, tasks, and services on [TPUs](https://cloud.google.com/tpu/docs/intro-to-tpu).
Choose a TPU instance by specifying the TPU version and the number of cores (e.g. `v5litepod-8`) in the `gpu` property under `resources`,
or request TPUs by specifying `tpu` as `vendor` ([see examples](../../guides/protips.md#gpu)).
Below are a few examples on using TPUs for deployment and fine-tuning.
!!! info "Multi-host TPUs"
Currently, `dstack` supports only single-host TPUs, which means that
the maximum supported number of cores is `8` (e.g. `v2-8`, `v3-8`, `v5litepod-8`, `v5p-8`, `v6e-8`).
Multi-host TPU support is on the roadmap.
!!! info "TPU storage"
By default, each TPU VM contains a 100GB boot disk and its size cannot be changed.
If you need more storage, attach additional disks using [Volumes](../../concepts/volumes.md).
## Deployment
Many serving frameworks including vLLM and TGI have TPU support.
Here's an example of a [service](../../concepts/services.md) that deploys Llama 3.1 8B using
[Optimum TPU](https://github.com/huggingface/optimum-tpu)
and [vLLM](https://github.com/vllm-project/vllm).
=== "Optimum TPU"
```yaml
type: service
name: llama31-service-optimum-tpu
image: dstackai/optimum-tpu:llama31
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_TOTAL_TOKENS=4096
- MAX_BATCH_PREFILL_TOKENS=4095
commands:
- text-generation-launcher --port 8000
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
resources:
gpu: v5litepod-4
```
Note that for Optimum TPU `MAX_INPUT_TOKEN` is set to 4095 by default. We must also set `MAX_BATCH_PREFILL_TOKENS` to 4095.
??? info "Docker image"
The official Docker image `huggingface/optimum-tpu:latest` doesn’t support Llama 3.1-8B.
We’ve created a custom image with the fix: `dstackai/optimum-tpu:llama31`.
Once the [pull request](https://github.com/huggingface/optimum-tpu/pull/92) is merged,
the official Docker image can be used.
=== "vLLM"
```yaml
type: service
name: llama31-service-vllm-tpu
env:
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- HF_TOKEN
- DATE=20240828
- TORCH_VERSION=2.5.0
- VLLM_TARGET_DEVICE=tpu
- MAX_MODEL_LEN=4096
commands:
- pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-${TORCH_VERSION}.dev${DATE}-cp311-cp311-linux_x86_64.whl
- pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-${TORCH_VERSION}.dev${DATE}-cp311-cp311-linux_x86_64.whl
- pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
- pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
- git clone https://github.com/vllm-project/vllm.git
- cd vllm
- pip install -r requirements-tpu.txt
- apt-get install -y libopenblas-base libopenmpi-dev libomp-dev
- python setup.py develop
- vllm serve $MODEL_ID
--tensor-parallel-size 4
--max-model-len $MAX_MODEL_LEN
--port 8000
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Uncomment to leverage spot instances
#spot_policy: auto
resources:
gpu: v5litepod-4
```
Note, when using Llama 3.1 8B with a `v5litepod` which has 16GB memory per core, we must limit the context size to 4096 tokens to fit the memory.
### Memory requirements
Below are the approximate memory requirements for serving LLMs with the minimal required TPU configuration:
| Model size | bfloat16 | TPU | int8 | TPU |
|------------|----------|--------------|-------|----------------|
| **8B** | 16GB | v5litepod-4 | 8GB | v5litepod-4 |
| **70B** | 140GB | v5litepod-16 | 70GB | v5litepod-16 |
| **405B** | 810GB | v5litepod-64 | 405GB | v5litepod-64 |
Note, `v5litepod` is optimized for serving transformer-based models. Each core is equipped with 16GB of memory.
### Supported frameworks
| Framework | Quantization | Note |
|-----------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **TGI** | bfloat16 | To deploy with TGI, Optimum TPU must be used. |
| **vLLM** | int8, bfloat16 | int8 quantization still requires the same memory because the weights are first moved to the TPU in bfloat16, and then converted to int8. See the [pull request](https://github.com/vllm-project/vllm/pull/7005) for more details. |
### Running a configuration
Once the configuration is ready, run `dstack apply -f `, and `dstack` will automatically provision the
cloud resources and run the configuration.
## Fine-tuning with Optimum TPU
Below is an example of fine-tuning Llama 3.1 8B using [Optimum TPU](https://github.com/huggingface/optimum-tpu)
and the [`Abirate/english_quotes`](https://huggingface.co/datasets/Abirate/english_quotes)
dataset.
```yaml
type: task
name: optimum-tpu-llama-train
python: "3.11"
env:
- HF_TOKEN
files:
- train.py
- config.yaml
commands:
- git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git
- mkdir -p optimum-tpu/examples/custom/
- cp train.py optimum-tpu/examples/custom/train.py
- cp config.yaml optimum-tpu/examples/custom/config.yaml
- cd optimum-tpu
- pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html
- pip install datasets evaluate
- pip install accelerate -U
- pip install peft
- python examples/custom/train.py examples/custom/config.yaml
resources:
gpu: v5litepod-8
```
[//]: # (### Fine-Tuning with TRL)
[//]: # (Use the example `examples/single-node-training/optimum-tpu/gemma/train.dstack.yml` to Finetune `Gemma-2B` model using `trl` with `dstack` and `optimum-tpu`. )
### Memory requirements
Below are the approximate memory requirements for fine-tuning LLMs with the minimal required TPU configuration:
| Model size | LoRA | TPU |
|------------|-------|--------------|
| **8B** | 16GB | v5litepod-8 |
| **70B** | 160GB | v5litepod-16 |
| **405B** | 950GB | v5litepod-64 |
Note, `v5litepod` is optimized for fine-tuning transformer-based models. Each core is equipped with 16GB of memory.
### Supported frameworks
| Framework | Quantization | Note |
|-----------------|--------------|---------------------------------------------------------------------------------------------------|
| **TRL** | bfloat16 | To fine-tune using TRL, Optimum TPU is recommended. TRL doesn't support Llama 3.1 out of the box. |
| **Pytorch XLA** | bfloat16 | |
## What's next?
1. Browse [Optimum TPU](https://github.com/huggingface/optimum-tpu),
[Optimum TPU TGI](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) and
[vLLM](https://docs.vllm.ai/en/latest/getting_started/tpu-installation.html).
2. Check [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md),
[services](../../concepts/services.md), and [fleets](../../concepts/fleets.md).
# docs/examples/accelerators/tenstorrent.md
---
title: Tenstorrent
description: Running dev environments, tasks, and services on Tenstorrent Wormhole accelerators
---
# Tenstorrent
`dstack` supports running dev environments, tasks, and services on Tenstorrent
[Wormwhole](https://tenstorrent.com/en/hardware/wormhole) accelerators via SSH fleets.
??? info "SSH fleets"
```yaml
type: fleet
name: tt-fleet
ssh_config:
user: root
identity_file: ~/.ssh/id_rsa
# Configure any number of hosts with n150 or n300 PCEe boards
hosts:
- 192.168.2.108
```
> Hosts should be pre-installed with [Tenstorrent software](https://docs.tenstorrent.com/getting-started/README.html#software-installation).
This should include the drivers, `tt-smi`, and HugePages.
To apply the fleet configuration, run:
```bash
$ dstack apply -f tt-fleet.dstack.yml
FLEET RESOURCES PRICE STATUS CREATED
tt-fleet cpu=12 mem=32GB disk=243GB n150:12GB $0 idle 18 sec ago
```
For more details on fleet configuration, refer to [SSH fleets](../../concepts/fleets.md#ssh-fleets).
## Services
Here's an example of a service that deploys
[`Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B)
using [Tenstorrent Inference Service](https://github.com/tenstorrent/tt-inference-server).
```yaml
type: service
name: tt-inference-server
env:
- HF_TOKEN
- HF_MODEL_REPO_ID=meta-llama/Llama-3.2-1B-Instruct
image: ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-release-ubuntu-20.04-amd64:0.0.4-v0.56.0-rc47-e2e0002ac7dc
commands:
- |
. ${PYTHON_ENV_DIR}/bin/activate
pip install "huggingface_hub[cli]"
export LLAMA_DIR="/data/models--$(echo "$HF_MODEL_REPO_ID" | sed 's/\//--/g')/"
huggingface-cli download $HF_MODEL_REPO_ID --local-dir $LLAMA_DIR
python /home/container_app_user/app/src/run_vllm_api_server.py
port: 7000
model: meta-llama/Llama-3.2-1B-Instruct
# Cache downloaded model
volumes:
- /mnt/data/tt-inference-server/data:/data
resources:
gpu: n150:1
```
Go ahead and run configuration using `dstack apply`:
```bash
$ dstack apply -f service.dstack.yml
```
Once the service is up, it will be available via the service endpoint
at `/proxy/services///`.
```shell
$ curl http://127.0.0.1:3000/proxy/services/main/tt-inference-server/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"stream": true,
"max_tokens": 512
}'
```
Additionally, the model is available via `dstack`'s control plane UI:
{ width=800 }
When a [gateway](../../concepts/gateways.md) is configured, the service endpoint
is available at `https://./`.
> Services support many options, including authentication, auto-scaling policies, etc. To learn more, refer to [Services](../../concepts/services.md).
## Tasks
Below is a task that simply runs `tt-smi -s`. Tasks can be used for training, fine-tuning, batch inference, or antything else.
```yaml
type: task
# The name is optional, if not specified, generated randomly
name: tt-smi
env:
- HF_TOKEN
# (Required) Use any image with TT drivers
image: dstackai/tt-smi:latest
# Use any commands
commands:
- tt-smi -s
# Specify the number of accelerators, model, etc
resources:
gpu: n150:1
# Uncomment if you want to run on a cluster of nodes
#nodes: 2
```
> Tasks support many options, including multi-node configuration, max duration, etc. To learn more, refer to [Tasks](../../concepts/tasks.md).
## Dev environments
Below is an example of a dev environment configuration. It can be used to provision a dev environemnt that can be accessed via your desktop IDE.
```yaml
type: dev-environment
# The name is optional, if not specified, generated randomly
name: cursor
# (Optional) List required env variables
env:
- HF_TOKEN
image: dstackai/tt-smi:latest
# Can be `vscode` or `cursor`
ide: cursor
resources:
gpu: n150:1
```
If you run it via `dstack apply`, it will output the URL to access it via your desktop IDE.
{ width=800 }
> Dev nevironments support many options, including inactivity and max duration, IDE configuration, etc. To learn more, refer to [Dev environments](../../concepts/tasks.md).
??? info "Feedback"
Found a bug, or want to request a feature? File it in the [issue tracker](https://github.com/dstackai/dstack/issues),
or share via [Discord](https://discord.gg/u8SmfwPpMd).