AMD¶
dstack natively supports AMD GPUs. This page covers the basics of setting up
fleets, running inference, training, and dev environments on AMD GPUs.
Fleets¶
dstack supports native cloud provisioning, and can also work with existing
Kubernetes clusters or vanilla bare-metal hosts.
dstack supports native provisioning of VMs with AMD GPUs across a number
of clouds, including
AMD Developer Cloud and
Hot Aisle. More cloud support is
coming soon.
To provision compute in these clouds, configure the corresponding backend and create a backend fleet.
To use dstack with existing Kubernetes cluster(s), configure the
kubernetes backend and point it
to your kubeconfig file. Then create a
backend fleet.
If you'd like dstack to use a cluster or machine that is already
provisioned and that you have access to, create an
SSH fleet.
Cluster placement
For multi-node workloads, the fleet must
set placement to cluster.
For Kubernetes and SSH fleets, the network must be properly configured.
To test whether the cluster is properly configured, run the RCCL tests via a distributed task.
Once a fleet is created, you can run dev environments, tasks, and services.
Inference¶
Here are examples of a service that deploys
Qwen/Qwen3.6-27B on AMD MI300X GPUs using
SGLang and
vLLM.
type: service
name: qwen36-sglang-amd
image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4..
PD disaggregation
To run SGLang with prefill and decode workers on an interconnected cluster of AMD GPU instances, see the SGLang PD disaggregation example.
For multi-node PD disaggregation, the fleet must use cluster placement.
type: service
name: qwen36-vllm-amd
image: vllm/vllm-openai-rocm:v0.19.1
commands:
- |
vllm serve Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len 262144 \
--reasoning-parser qwen3
port: 8000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4..
Use the dstack apply command to apply
any configuration, including services, tasks, dev environments, and fleets.
$ dstack apply -f service.dstack.yml
Training¶
Below is a task that fine-tunes a small language model using the official Transformers causal language modeling example on AMD GPUs.
type: task
name: amd-qwen3-train
image: rocm/pytorch:latest
commands:
- git clone --depth 1 https://github.com/huggingface/transformers.git
- pip install -e ./transformers -r transformers/examples/pytorch/language-modeling/requirements.txt
- |
torchrun --standalone --nproc-per-node $DSTACK_GPUS_PER_NODE \
transformers/examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path Qwen/Qwen3-0.6B-Base \
--dataset_name Salesforce/wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--max_steps 10 \
--block_size 512 \
--learning_rate 2e-5 \
--bf16 \
--logging_steps 1 \
--output_dir /tmp/qwen3-clm
resources:
gpu: MI300X:4..
disk: 100GB..
Distributed tasks
To run training across multiple nodes, use distributed tasks. Distributed tasks may run on a cluster; in that case, the fleet must use cluster placement.
Dev environments¶
Here's an example of a dev environment that can be accessed via your desktop IDE.
type: dev-environment
name: amd-vscode
image: rocm/dev-ubuntu-24.04
ide: vscode
resources:
gpu: MI300X:1
Docker image¶
If you'd like a run to use AMD GPUs, make sure to specify
image.
The image's ROCm runtime must be compatible with the AMD GPUs the run will use. The image should also include the packages your workload needs.
Metrics¶
Run and job metrics include CPU, memory, and GPU usage. They are available in the UI and via the CLI:
$ dstack metrics <run name>
AMD GPU metrics require
amd-smito be available in the run image. If it isn't present, GPU metrics may be unavailable.
What's next?¶
- Browse the dedicated SGLang and vLLM examples, plus the Qwen 3.6 model page.
- For multi-node inference, see SGLang PD disaggregation.
- For cluster validation, run NCCL/RCCL tests.
- Check dev environments, tasks, services, fleets, and backends.