Tasks¶

Tasks allow for convenient scheduling of various batch jobs, such as training, fine-tuning, or data processing, as well as running web applications.

You can run tasks on a single machine or on a cluster of nodes.

Configuration¶

First, create a YAML file in your project folder. Its name must end with .dstack.yml (e.g. .dstack.yml or train.dstack.yml are both acceptable).

type: task

# Specify the Python version, or your Docker image
python: "3.11"

# Specify environment variables
env:
  - HF_HUB_ENABLE_HF_TRANSFER=1

# The commands to run on start of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

# Specify GPU, disk, and other resource requirements
resources:
  gpu: 80GB

See the .dstack.yml reference for more details.

If you don't specify your Docker image, dstack uses the base image (pre-configured with Python, Conda, and essential CUDA drivers).

Environment variables¶

Environment variables can be set either within the configuration file or passed via the CLI.

type: task

python: "3.11"
env:
  - HUGGING_FACE_HUB_TOKEN
  - HF_HUB_ENABLE_HF_TRANSFER=1
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

resources:
  gpu: 80GB

If you don't assign a value to an environment variable (see HUGGING_FACE_HUB_TOKEN above), dstack will require the value to be passed via the CLI or set in the current process.

For instance, you can define environment variables in a .env file and utilize tools like direnv.

Ports¶

A task can configure ports. In this case, if the task is running an application on a port, dstack run will securely allow you to access this port from your local machine through port forwarding.

type: task

python: "3.11"
env:
  - HF_HUB_ENABLE_HF_TRANSFER=1
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - tensorboard --logdir results/runs &
  - python fine-tuning/qlora/train.py
ports:
  - 6000

# (Optional) Configure `gpu`, `memory`, `disk`, etc
resources:
  gpu: 80GB

When running it, dstack run forwards 6000 port to localhost:6000, enabling secure access.

Port mapping

By default, dstack uses the same ports on your local machine for port forwarding. However, you can override local ports using --port:

$ dstack run . -f train.dstack.yml --port 6000:6001

This will forward the task's port 6000 to localhost:6001.

Nodes¶

By default, the task runs on a single node. However, you can run it on a cluster of nodes.

type: task

# The size of the cluster
nodes: 2

python: "3.11"
env:
  - HF_HUB_ENABLE_HF_TRANSFER=1
commands:
  - pip install -r requirements.txt
  - torchrun
    --nproc_per_node=$DSTACK_GPUS_PER_NODE
    --node_rank=$DSTACK_NODE_RANK
    --nnodes=$DSTACK_NODES_NUM
    --master_addr=$DSTACK_MASTER_NODE_IP
    --master_port=8008 resnet_ddp.py
    --num_epochs 20

resources:
  gpu: 24GB

If you run the task, dstack first provisions the master node and then runs the other nodes of the cluster. All nodes are provisioned in the same region.

Backends

Running on multiple nodes is supported only with AWS, GCP, and Azure.

Args¶

You can parameterize tasks with user arguments using ${{ run.args }} in the configuration.

type: task

python: "3.11"
env:
  - HF_HUB_ENABLE_HF_TRANSFER=1
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py ${{ run.args }}

resources:
  gpu: 80GB

Now, you can pass your arguments to the dstack run command:

$ dstack run . -f train.dstack.yml --train_batch_size=1 --num_train_epochs=100

The dstack run command will pass --train_batch_size=1 and --num_train_epochs=100 as arguments to train.py.

Profiles

In case you'd like to reuse certain parameters (such as spot policy, retry and max duration, max price, regions, instance types, etc.) across runs, you can define them via .dstack/profiles.yml.

Running¶

To run a configuration, use the dstack run command followed by the working directory path, configuration file path, and other options.

$ dstack run . -f train.dstack.yml

 BACKEND     REGION         RESOURCES                     SPOT  PRICE
 tensordock  unitedkingdom  10xCPU, 80GB, 1xA100 (80GB)   no    $1.595
 azure       westus3        24xCPU, 220GB, 1xA100 (80GB)  no    $3.673
 azure       westus2        24xCPU, 220GB, 1xA100 (80GB)  no    $3.673

Continue? [y/n]: y

Provisioning...
---> 100%

Epoch 0:  100% 1719/1719 [00:18<00:00, 92.32it/s, loss=0.0981, acc=0.969]
Epoch 1:  100% 1719/1719 [00:18<00:00, 92.32it/s, loss=0.0981, acc=0.969]
Epoch 2:  100% 1719/1719 [00:18<00:00, 92.32it/s, loss=0.0981, acc=0.969]

When dstack submits the task, it uses the current folder contents.

.gitignore

If there are large files or folders you'd like to avoid uploading, you can list them in .gitignore.

The dstack run command allows specifying many things, including spot policy, retry and max duration, max price, regions, instance types, and much more.

Managing runs¶

Stoping runs

Once the run exceeds the max duration, or when you use dstack stop, the task and its cloud resources are deleted.

Listing runs

The dstack ps command lists all running runs and their status.

What's next?¶

Check the QLoRA example
Check the .dstack.yml reference for more details and examples
Browse all examples