Skip to content

task

The task configuration type allows running tasks.

Configuration files must be inside the project repo, and their names must end with .dstack.yml (e.g. .dstack.yml or train.dstack.yml are both acceptable). Any configuration can be run via dstack apply.

Examples

Python version

If you don't specify image, dstack uses its base Docker image pre-configured with python, pip, conda (Miniforge), and essential CUDA drivers. The python property determines which default Docker image is used.

type: task
# The name is optional, if not specified, generated randomly
name: train    

# If `image` is not specified, dstack uses its base image
python: "3.10"

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py
nvcc

By default, the base Docker image doesn’t include nvcc, which is required for building custom CUDA kernels. If you need nvcc, set the corresponding property to true.

type: task
# The name is optional, if not specified, generated randomly
name: train    

# If `image` is not specified, dstack uses its base image
python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention) 
nvcc: true

commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

Ports

A task can configure ports. In this case, if the task is running an application on a port, dstack run will securely allow you to access this port from your local machine through port forwarding.

type: task
# The name is optional, if not specified, generated randomly
name: train    

python: "3.10"

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - tensorboard --logdir results/runs &
  - python fine-tuning/qlora/train.py
# Expose the port to access TensorBoard
ports:
  - 6000

When running it, dstack run forwards 6000 port to localhost:6000, enabling secure access.

Docker

If you want, you can specify your own Docker image via image.

type: dev-environment
# The name is optional, if not specified, generated randomly
name: train    

# Any custom Docker image
image: dstackai/base:py3.13-0.6-cuda-12.1

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py
Private registry

Use the registry_auth property to provide credentials for a private Docker registry.

type: dev-environment
# The name is optional, if not specified, generated randomly
name: train

# Any private Docker image
image: dstackai/base:py3.13-0.6-cuda-12.1
# Credentials of the private Docker registry
registry_auth:
  username: peterschmidt85
  password: ghp_e49HcZ9oYwBzUbcSk2080gXZOU2hiT9AeSR5

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

Docker and Docker Compose

All backends except runpod, vastai and kubernetes also allow to use Docker and Docker Compose inside dstack runs.

Resources

If you specify memory size, you can either specify an explicit size (e.g. 24GB) or a range (e.g. 24GB.., or 24GB..80GB, or ..80GB).

type: task
# The name is optional, if not specified, generated randomly
name: train    

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

resources:
  # 200GB or more RAM
  memory: 200GB..
  # 4 GPUs from 40GB to 80GB
  gpu: 40GB..80GB:4
  # Shared memory (required by multi-gpu)
  shm_size: 16GB
  # Disk size
  disk: 500GB

The gpu property allows specifying not only memory size but also GPU vendor, names and their quantity. Examples: nvidia (one NVIDIA GPU), A100 (one A100), A10G,A100 (either A10G or A100), A100:80GB (one A100 of 80GB), A100:2 (two A100), 24GB..40GB:2 (two GPUs between 24GB and 40GB), A100:40GB:2 (two A100 GPUs of 40GB).

Google Cloud TPU

To use TPUs, specify its architecture via the gpu property.

type: task
# The name is optional, if not specified, generated randomly
name: train

python: "3.10"

# Commands of the task
commands:
  - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  - git clone --recursive https://github.com/pytorch/xla.git
  - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

resources:
  gpu: v2-8

Currently, only 8 TPU cores can be specified, supporting single host workloads. Multi-host support is coming soon.

Shared memory

If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure shm_size, e.g. set it to 16GB.

Environment variables

type: task

python: "3.10"

# Environment variables
env:
  - HF_TOKEN
  - HF_HUB_ENABLE_HF_TRANSFER=1

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

If you don't assign a value to an environment variable (see HF_TOKEN above), dstack will require the value to be passed via the CLI or set in the current process.

For instance, you can define environment variables in a .envrc file and utilize tools like direnv.

System environment variables

The following environment variables are available in any run and are passed by dstack by default:

Name Description
DSTACK_RUN_NAME The name of the run
DSTACK_REPO_ID The ID of the repo
DSTACK_GPUS_NUM The total number of GPUs in the run
DSTACK_NODES_NUM The number of nodes in the run
DSTACK_NODE_RANK The rank of the node
DSTACK_MASTER_NODE_IP The internal IP address the master node
DSTACK_NODES_IPS The list of internal IP addresses of all nodes delimited by "\n"

Distributed tasks

By default, the task runs on a single node. However, you can run it on a cluster of nodes.

type: task
# The name is optional, if not specified, generated randomly
name: train-distrib

# The size of the cluster
nodes: 2

python: "3.10"

# Commands of the task
commands:
  - pip install -r requirements.txt
  - torchrun
    --nproc_per_node=$DSTACK_GPUS_PER_NODE
    --node_rank=$DSTACK_NODE_RANK
    --nnodes=$DSTACK_NODES_NUM
    --master_addr=$DSTACK_MASTER_NODE_IP
    --master_port=8008 resnet_ddp.py
    --num_epochs 20

resources:
  gpu: 24GB

If you run the task, dstack first provisions the master node and then runs the other nodes of the cluster.

Network

To ensure all nodes are provisioned into a cluster placement group and to enable the highest level of inter-node connectivity, it is recommended to manually create a fleet before running a task. This won’t be needed once this issue is fixed.

dstack is easy to use with accelerate, torchrun, and other distributed frameworks. All you need to do is pass the corresponding environment variables such as DSTACK_GPUS_PER_NODE, DSTACK_NODE_RANK, DSTACK_NODES_NUM, DSTACK_MASTER_NODE_IP, and DSTACK_GPUS_NUM (see System environment variables).

Backends

Running on multiple nodes is supported only with the aws, gcp, azure, oci backends, or SSH fleets.

Additionally, the aws backend supports Elastic Fabric Adapter . For a list of instance types with EFA support see Fleets.

Web applications

Here's an example of using ports to run web apps with tasks.

type: task
# The name is optional, if not specified, generated randomly
name: streamlit-hello

python: "3.10"

# Commands of the task
commands:
  - pip3 install streamlit
  - streamlit hello
# Expose the port to access the web app
ports: 
  - 8501

Spot policy

You can choose whether to use spot instances, on-demand instances, or any available type.

type: task
# The name is optional, if not specified, generated randomly
name: train    

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

# Uncomment to leverage spot instances
#spot_policy: auto

The spot_policy accepts spot, on-demand, and auto. The default for tasks is on-demand.

Queueing tasks

By default, if dstack apply cannot find capacity, the task fails.

To queue the task and wait for capacity, specify the retry property:

type: task
# The name is optional, if not specified, generated randomly
name: train

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

retry:
  # Retry on no-capacity errors
  on_events: [no-capacity]
  # Retry within 1 day
  duration: 1d

Backends

By default, dstack provisions instances in all configured backends. However, you can specify the list of backends:

type: task
# The name is optional, if not specified, generated randomly
name: train

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

# Use only listed backends
backends: [aws, gcp]

Regions

By default, dstack uses all configured regions. However, you can specify the list of regions:

type: task
# The name is optional, if not specified, generated randomly
name: train

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

# Use only listed regions
regions: [eu-west-1, eu-west-2]

Volumes

Volumes allow you to persist data between runs. To attach a volume, simply specify its name using the volumes property and specify where to mount its contents:

type: task
# The name is optional, if not specified, generated randomly
name: vscode    

python: "3.10"

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

# Map the name of the volume to any path
volumes:
  - name: my-new-volume
    path: /volume_data

Once you run this configuration, the contents of the volume will be attached to /volume_data inside the task, and its contents will persist across runs.

Instance volumes

If data persistence is not a strict requirement, use can also use ephemeral instance volumes.

Limitations

When you're running a dev environment, task, or service with dstack, it automatically mounts the project folder contents to /workflow (and sets that as the current working directory). Right now, dstack doesn't allow you to attach volumes to /workflow or any of its subdirectories.

The task configuration type supports many other options. See below.

Root reference

nodes - (Optional) Number of nodes. Defaults to 1.

name - (Optional) The run name.

image - (Optional) The name of the Docker image to run.

privileged - (Optional) Run the container in privileged mode.

entrypoint - (Optional) The Docker entrypoint.

working_dir - (Optional) The path to the working directory inside the container. It's specified relative to the repository directory (/workflow) and should be inside it. Defaults to "." .

home_dir - (Optional) The absolute path to the home directory inside the container. Defaults to /root. Defaults to /root.

registry_auth - (Optional) Credentials for pulling a private Docker image.

python - (Optional) The major version of Python. Mutually exclusive with image.

nvcc - (Optional) Use image with NVIDIA CUDA Compiler (NVCC) included. Mutually exclusive with image.

env - (Optional) The mapping or the list of environment variables.

setup - (Optional) The bash commands to run on the boot.

resources - (Optional) The resources requirements to run the configuration.

volumes - (Optional) The volumes mount points.

ports - (Optional) Port numbers/mapping to expose.

commands - (Optional) The bash commands to run.

backends - (Optional) The backends to consider for provisioning (e.g., [aws, gcp]).

regions - (Optional) The regions to consider for provisioning (e.g., [eu-west-1, us-west4, westeurope]).

instance_types - (Optional) The cloud-specific instance types to consider for provisioning (e.g., [p3.8xlarge, n1-standard-4]).

spot_policy - (Optional) The policy for provisioning spot or on-demand instances: spot, on-demand, or auto. Defaults to on-demand.

retry - (Optional) The policy for resubmitting the run. Defaults to false.

retry_policy - (Optional) The policy for resubmitting the run. Deprecated in favor of retry.

max_duration - (Optional) The maximum duration of a run (e.g., 2h, 1d, etc). After it elapses, the run is forced to stop. Defaults to off.

max_price - (Optional) The maximum instance price per hour, in dollars.

pool_name - (Optional) The name of the pool. If not set, dstack will use the default name.

instance_name - (Optional) The name of the instance.

creation_policy - (Optional) The policy for using instances from the pool. Defaults to reuse-or-create.

termination_policy - (Optional) The policy for instance termination. Defaults to destroy-after-idle.

termination_idle_time - (Optional) Time to wait before destroying the idle instance. Defaults to 5m for dstack run and to 3d for dstack pool add.

retry

on_events - The list of events that should be handled with retry. Supported events are no-capacity, interruption, and error.

duration - (Optional) The maximum period of retrying the run, e.g., 4h or 1d.

resources

cpu - (Optional) The number of CPU cores. Defaults to 2...

memory - (Optional) The RAM size (e.g., 8GB). Defaults to 8GB...

shm_size - (Optional) The size of shared memory (e.g., 8GB). If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure this.

gpu - (Optional) The GPU requirements. Can be set to a number, a string (e.g. A100, 80GB:2, etc.), or an object.

disk - (Optional) The disk resources.

resouces.gpu

vendor - (Optional) The vendor of the GPU/accelerator, one of: nvidia, amd, google (alias: tpu).

name - (Optional) The GPU name or list of names.

count - (Optional) The number of GPUs. Defaults to 1.

memory - (Optional) The RAM size (e.g., 16GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

total_memory - (Optional) The total RAM size (e.g., 32GB). Can be set to a range (e.g. 16GB.., or 16GB..80GB).

compute_capability - (Optional) The minimum compute capability of the GPU (e.g., 7.5).

resouces.disk

size - The disk size. Can be a string (e.g., 100GB or 100GB..) or an object.

registry_auth

username - The username.

password - The password or access token.

volumes[n]

name - The network volume name or the list of network volume names to mount. If a list is specified, one of the volumes in the list will be mounted. Specify volumes from different backends/regions to increase availability..

path - The absolute container path to mount the volume at.

instance_path - The absolute path on the instance (host).

path - The absolute path in the container.

Short syntax

The short syntax for volumes is a colon-separated string in the form of source:destination

  • volume-name:/container/path for network volumes
  • /instance/path:/container/path for instance volumes