Tasks¶
A task allows you to run arbitrary commands on one or more nodes. They are best suited for jobs like training or batch processing.
Apply a configuration¶
First, define a task configuration as a YAML file in your project folder.
The filename must end with .dstack.yml
(e.g. .dstack.yml
or dev.dstack.yml
are both acceptable).
type: task
# The name is optional, if not specified, generated randomly
name: trl-sft
python: 3.12
# Uncomment to use a custom Docker image
#image: huggingface/trl-latest-gpu
env:
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
commands:
- uv pip install trl
- |
trl sft \
--model_name_or_path $MODEL --dataset_name $DATASET
--num_processes $DSTACK_GPUS_PER_NODE
resources:
# One to two H100 GPUs
gpu: H100:1..2
shm_size: 24GB
To run a task, pass the configuration to dstack apply
:
$ dstack apply -f .dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:3 yes $0.33
Submit the run trl-sft? [y/n]: y
Launching `axolotl-train`...
---> 100%
{'loss': 1.4967, 'grad_norm': 1.2734375, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}
0% 1/24680 [00:13<95:34:17, 13.94s/it]
6% 73/1300 [00:48<13:57, 1.47it/s]
dstack apply
automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes),
and runs the commands.
Configuration options¶
Ports¶
A task can configure ports. In this case, if the task is running an application on a port, dstack apply
will securely allow you to access this port from your local machine through port forwarding.
type: task
name: streamlit-hello
python: 3.12
commands:
- uv pip install streamlit
- streamlit hello
ports:
- 8501
When running it, dstack apply
forwards 8501
port to localhost:8501
, enabling secure access to the running
application.
Distributed tasks¶
By default, a task runs on a single node.
However, you can run it on a cluster of nodes by specifying nodes
.
type: task
name: train-distrib
nodes: 2
python: 3.12
env:
- NCCL_DEBUG=INFO
commands:
- git clone https://github.com/pytorch/examples.git pytorch-examples
- cd pytorch-examples/distributed/ddp-tutorial-series
- uv pip install -r requirements.txt
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
multinode.py 50 10
resources:
gpu: 24GB:1..2
shm_size: 24GB
Nodes can communicate using their private IP addresses.
Use DSTACK_MASTER_NODE_IP
, DSTACK_NODES_IPS
, DSTACK_NODE_RANK
, and other
System environment variables for inter-node communication.
dstack
is easy to use with accelerate
, torchrun
, Ray, Spark, and any other distributed frameworks.
MPI
If want to use MPI, you can set startup_order
to workers-first
and stop_criteria
to master-done
, and use DSTACK_MPI_HOSTFILE
.
See the NCCL or RCCL examples.
For detailed examples, see distributed training examples.
Network interface
Distributed frameworks usually detect the correct network interface automatically, but sometimes you need to specify it explicitly.
For example, with PyTorch and the NCCL backend, you may need to add these commands to tell NCCL to use the private interface:
commands:
- apt-get install -y iproute2
- >
if [[ $DSTACK_NODE_RANK == 0 ]]; then
export NCCL_SOCKET_IFNAME=$(ip -4 -o addr show | fgrep $DSTACK_MASTER_NODE_IP | awk '{print $2}')
else
export NCCL_SOCKET_IFNAME=$(ip route get $DSTACK_MASTER_NODE_IP | sed -E 's/.*?dev (\S+) .*/\1/;t;d')
fi
# ... The rest of the commands
SSH
You can log in to any node from any node via SSH on port 10022 using the ~/.ssh/dstack_job
private key.
For convenience, ~/.ssh/config
is preconfigured with these options, so a simple ssh <node_ip>
is enough.
For a list of nodes IPs check the DSTACK_NODES_IPS
environment variable.
Fleets
Distributed tasks can only run on fleets with
cluster placement.
While dstack
can provision such fleets automatically, it is
recommended to create them via a fleet configuration
to ensure the highest level of inter-node connectivity.
See the Clusters guide for more details on how to use
dstack
on clusters.
Resources¶
When you specify a resource value like cpu
or memory
,
you can either use an exact value (e.g. 24GB
) or a
range (e.g. 24GB..
, or 24GB..80GB
, or ..80GB
).
type: task
name: trl-sft
python: 3.12
env:
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
commands:
- uv pip install trl
- |
trl sft \
--model_name_or_path $MODEL --dataset_name $DATASET
--num_processes $DSTACK_GPUS_PER_NODE
resources:
# 16 or more x86_64 cores
cpu: 16..
# 200GB or more RAM
memory: 200GB..
# 4 GPUs from 40GB to 80GB
gpu: 40GB..80GB:4
# Shared memory (required by multi-gpu)
shm_size: 24GB
# Disk size
disk: 500GB
The cpu
property lets you set the architecture (x86
or arm
) and core count — e.g., x86:16
(16 x86 cores), arm:8..
(at least 8 ARM cores).
If not set, dstack
infers it from the GPU or defaults to x86
.
The gpu
property lets you specify vendor, model, memory, and count — e.g., nvidia
(one NVIDIA GPU), A100
(one A100), A10G,A100
(either), A100:80GB
(one 80GB A100), A100:2
(two A100), 24GB..40GB:2
(two GPUs with 24–40GB), A100:40GB:2
(two 40GB A100s).
If vendor is omitted, dstack
infers it from the model or defaults to nvidia
.
```yaml
type: task
name: train
python: 3.12
commands:
- pip install -r fine-tuning/qlora/requirements.txt
- python fine-tuning/qlora/train.py
resources:
gpu: v2-8
```
Currently, only 8 TPU cores can be specified, supporting single TPU device workloads. Multi-TPU support is coming soon. -->
Shared memory
If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure
shm_size
, e.g. set it to 24GB
.
If you’re unsure which offers (hardware configurations) are available from the configured backends, use the
dstack offer
command to list them.
Docker¶
Default image¶
If you don't specify image
, dstack
uses its base Docker image pre-configured with
uv
, python
, pip
, essential CUDA drivers, and NCCL tests (under /opt/nccl-tests/build
).
Set the python
property to pre-install a specific version of Python.
type: task
name: train
python: 3.12
env:
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
commands:
- uv pip install trl
- |
trl sft \
--model_name_or_path $MODEL --dataset_name $DATASET
--num_processes $DSTACK_GPUS_PER_NODE
resources:
gpu: H100:1..2
shm_size: 24GB
NVCC¶
By default, the base Docker image doesn’t include nvcc
, which is required for building custom CUDA kernels.
If you need nvcc
, set the nvcc
property to true.
type: task
name: train
python: 3.12
nvcc: true
env:
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
commands:
- uv pip install trl
- uv pip install flash_attn --no-build-isolation
- |
trl sft \
--model_name_or_path $MODEL --dataset_name $DATASET \
--attn_implementation=flash_attention_2 \
--num_processes $DSTACK_GPUS_PER_NODE
resources:
gpu: H100:1
Custom image¶
If you want, you can specify your own Docker image via image
.
type: task
name: trl-sft
image: huggingface/trl-latest-gpu
env:
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
# if shell is not specified, `sh` is used for custom images
shell: bash
commands:
- source activate trl
- |
trl sft --model_name_or_path $MODEL \
--dataset_name $DATASET \
--output_dir /output \
--torch_dtype bfloat16 \
--use_peft true
resources:
gpu: H100:1
Docker in Docker¶
Set docker
to true
to enable the docker
CLI in your task, e.g., to run or build Docker images, or use Docker Compose.
type: task
name: docker-nvidia-smi
docker: true
commands:
- docker run --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
resources:
gpu: 1
Cannot be used with python
or image
. Not supported on runpod
, vastai
, or kubernetes
.
Privileged mode¶
To enable privileged mode, set privileged
to true
.
Not supported with runpod
, vastai
, and kubernetes
.
Private registry¶
Use the registry_auth
property to provide credentials for a private Docker registry.
type: task
name: train
env:
- NGC_API_KEY
image: nvcr.io/nvidia/pytorch:25.05-py3
registry_auth:
username: $oauthtoken
password: ${{ env.NGC_API_KEY }}
commands:
- git clone https://github.com/pytorch/examples.git pytorch-examples
- cd pytorch-examples/distributed/ddp-tutorial-series
- pip install -r requirements.txt
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--nnodes=$DSTACK_NODES_NUM \
multinode.py 50 10
resources:
gpu: H100:1..2
shm_size: 24GB
Environment variables¶
type: task
name: trl-sft
python: 3.12
env:
- HF_TOKEN
- HF_HUB_ENABLE_HF_TRANSFER=1
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
commands:
- uv pip install trl
- |
trl sft \
--model_name_or_path $MODEL --dataset_name $DATASET
--num_processes $DSTACK_GPUS_PER_NODE
resources:
gpu: H100:1
If you don't assign a value to an environment variable (see HF_TOKEN
above),
dstack
will require the value to be passed via the CLI or set in the current process.
System environment variables
The following environment variables are available in any run by default:
Name | Description |
---|---|
DSTACK_RUN_NAME |
The name of the run |
DSTACK_REPO_ID |
The ID of the repo |
DSTACK_GPUS_NUM |
The total number of GPUs in the run |
DSTACK_NODES_NUM |
The number of nodes in the run |
DSTACK_GPUS_PER_NODE |
The number of GPUs per node |
DSTACK_NODE_RANK |
The rank of the node |
DSTACK_MASTER_NODE_IP |
The internal IP address of the master node |
DSTACK_NODES_IPS |
The list of internal IP addresses of all nodes delimited by "\n" |
DSTACK_MPI_HOSTFILE |
The path to a pre-populated MPI hostfile |
Retry policy¶
By default, if dstack
can't find capacity, or the task exits with an error, or the instance is interrupted,
the run will fail.
If you'd like dstack
to automatically retry, configure the
retry property accordingly:
type: task
name: train
python: 3.12
commands:
- uv pip install -r fine-tuning/qlora/requirements.txt
- python fine-tuning/qlora/train.py
retry:
on_events: [no-capacity, error, interruption]
# Retry for up to 1 hour
duration: 1h
If one job of a multi-node task fails with retry enabled,
dstack
will stop all the jobs and resubmit the run.
Priority¶
Be default, submitted runs are scheduled in the order they were submitted.
When compute resources are limited, you may want to prioritize some runs over others.
This can be done by specifying the priority
property in the run configuration:
type: task
name: train
python: 3.12
commands:
- uv pip install -r fine-tuning/qlora/requirements.txt
- python fine-tuning/qlora/train.py
priority: 50
dstack
tries to provision runs with higher priority first.
Note that if a high priority run cannot be scheduled,
it does not block other runs with lower priority from scheduling.
Utilization policy¶
Sometimes it’s useful to track whether a task is fully utilizing all GPUs. While you can check this with
dstack metrics
, dstack
also lets you set a policy to auto-terminate the run if any GPU is underutilized.
Below is an example of a task that auto-terminate if any GPU stays below 10% utilization for 1 hour.
type: task
name: train
python: 3.12
commands:
- uv pip install -r fine-tuning/qlora/requirements.txt
- python fine-tuning/qlora/train.py
resources:
gpu: H100:8
utilization_policy:
min_gpu_utilization: 10
time_window: 1h
Spot policy¶
By default, dstack
uses on-demand instances. However, you can change that
via the spot_policy
property. It accepts spot
, on-demand
, and auto
.
Creation policy¶
By default, when you run dstack apply
with a dev environment, task, or service,
if no idle
instances from the available fleets meet the requirements, dstack
creates a new fleet
using configured backends.
To ensure dstack apply
doesn't create a new fleet but reuses an existing one,
pass -R
(or --reuse
) to dstack apply
.
$ dstack apply -R -f examples/.dstack.yml
Or, set creation_policy
to reuse
in the run configuration.
Idle duration¶
If a fleet is created automatically, it stays idle
for 5 minutes by default and can be reused within that time.
If the fleet is not reused within this period, it is automatically terminated.
To change the default idle duration, set
idle_duration
in the run configuration (e.g., 0s
, 1m
, or off
for
unlimited).
Fleets
For greater control over fleet provisioning, it is recommended to create fleets explicitly.
Reference
Tasks support many more configuration options,
incl. backends
,
regions
,
max_price
, and
max_duration
,
among others.
Manage runs¶
dstack
provides several commands to manage runs:
dstack ps
– Lists all running jobs and their statuses. Use--watch
(or-w
) to monitor the live status of runs.dstack stop
– Stops a run gracefully. Pass--abort
or-x
to stop it immediately without waiting for a graceful shutdown. By default, a run runs until you stop it or its lifetime exceeds the value ofmax_duration
.dstack attach
– By default,dstack apply
runs in attached mode, establishing an SSH tunnel to the run, forwarding ports, and displaying real-time logs. If you detach from a run, use this command to reattach.dstack logs
– Displays run logs. Pass--diagnose
or-d
to view diagnostic logs, which can help troubleshoot failed runs.
What's next?
- Read about dev environments, services, and repos
- Learn how to manage fleets
- Check the Axolotl example