Skip to content

AMD

dstack natively supports AMD GPUs. This page covers the basics of setting up fleets, running inference, training, and dev environments on AMD GPUs.

Fleets

dstack supports native cloud provisioning, and can also work with existing Kubernetes clusters or vanilla bare-metal hosts.

dstack supports native provisioning of VMs with AMD GPUs across a number of clouds, including AMD Developer Cloud and Hot Aisle. More cloud support is coming soon.

To provision compute in these clouds, configure the corresponding backend and create a backend fleet.

To use dstack with existing Kubernetes cluster(s), configure the kubernetes backend and point it to your kubeconfig file. Then create a backend fleet.

If you'd like dstack to use a cluster or machine that is already provisioned and that you have access to, create an SSH fleet.

Cluster placement

For multi-node workloads, the fleet must set placement to cluster. For Kubernetes and SSH fleets, the network must be properly configured.

To test whether the cluster is properly configured, run the RCCL tests via a distributed task.

Once a fleet is created, you can run dev environments, tasks, and services.

Inference

Here are examples of a service that deploys Qwen/Qwen3.6-27B on AMD MI300X GPUs using SGLang and vLLM.

type: service
name: qwen36-sglang-amd

image: lmsysorg/sglang:v0.5.10-rocm720-mi30x

commands:
  - |
    sglang serve \
      --model-path Qwen/Qwen3.6-27B \
      --host 0.0.0.0 \
      --port 30000 \
      --tp $DSTACK_GPUS_NUM \
      --reasoning-parser qwen3 \
      --mem-fraction-static 0.8 \
      --context-length 262144

port: 30000
model: Qwen/Qwen3.6-27B

volumes:
  - instance_path: /root/.cache
    path: /root/.cache
    optional: true

resources:
  cpu: 52..
  memory: 896GB..
  shm_size: 16GB
  disk: 450GB..
  gpu: MI300X:4..

PD disaggregation

To run SGLang with prefill and decode workers on an interconnected cluster of AMD GPU instances, see the SGLang PD disaggregation example.

For multi-node PD disaggregation, the fleet must use cluster placement.

type: service
name: qwen36-vllm-amd

image: vllm/vllm-openai-rocm:v0.19.1

commands:
  - |
    vllm serve Qwen/Qwen3.6-27B \
      --host 0.0.0.0 \
      --port 8000 \
      --tensor-parallel-size $DSTACK_GPUS_NUM \
      --max-model-len 262144 \
      --reasoning-parser qwen3

port: 8000
model: Qwen/Qwen3.6-27B

volumes:
  - instance_path: /root/.cache
    path: /root/.cache
    optional: true

resources:
  cpu: 52..
  memory: 896GB..
  shm_size: 16GB
  disk: 450GB..
  gpu: MI300X:4..

Use the dstack apply command to apply any configuration, including services, tasks, dev environments, and fleets.

$ dstack apply -f service.dstack.yml

Training

Below is a task that fine-tunes a small language model using the official Transformers causal language modeling example on AMD GPUs.

type: task
name: amd-qwen3-train

image: rocm/pytorch:latest

commands:
  - git clone --depth 1 https://github.com/huggingface/transformers.git
  - pip install -e ./transformers -r transformers/examples/pytorch/language-modeling/requirements.txt
  - |
    torchrun --standalone --nproc-per-node $DSTACK_GPUS_PER_NODE \
      transformers/examples/pytorch/language-modeling/run_clm.py \
      --model_name_or_path Qwen/Qwen3-0.6B-Base \
      --dataset_name Salesforce/wikitext \
      --dataset_config_name wikitext-2-raw-v1 \
      --do_train \
      --per_device_train_batch_size 1 \
      --gradient_accumulation_steps 8 \
      --max_steps 10 \
      --block_size 512 \
      --learning_rate 2e-5 \
      --bf16 \
      --logging_steps 1 \
      --output_dir /tmp/qwen3-clm

resources:
  gpu: MI300X:4..
  disk: 100GB..

Distributed tasks

To run training across multiple nodes, use distributed tasks. Distributed tasks may run on a cluster; in that case, the fleet must use cluster placement.

Dev environments

Here's an example of a dev environment that can be accessed via your desktop IDE.

type: dev-environment
name: amd-vscode

image: rocm/dev-ubuntu-24.04

ide: vscode

resources:
  gpu: MI300X:1

Docker image

If you'd like a run to use AMD GPUs, make sure to specify image.

The image's ROCm runtime must be compatible with the AMD GPUs the run will use. The image should also include the packages your workload needs.

Metrics

Run and job metrics include CPU, memory, and GPU usage. They are available in the UI and via the CLI:

$ dstack metrics <run name>

AMD GPU metrics require amd-smi to be available in the run image. If it isn't present, GPU metrics may be unavailable.

What's next?

  1. Browse the dedicated SGLang and vLLM examples, plus the Qwen 3.6 model page.
  2. For multi-node inference, see SGLang PD disaggregation.
  3. For cluster validation, run NCCL/RCCL tests.
  4. Check dev environments, tasks, services, fleets, and backends.