Ray + RAGEN¶
This example shows how use dstack and RAGEN
to fine-tune an agent on multiple nodes.
Under the hood RAGEN uses verl for Reinforcement Learning and Ray for distributed training.
Prerequisites
Before running a distributed task, make sure to create a fleet with placement set to cluster (can be a managed fleet or an SSH fleet).
Run a Ray cluster¶
If you want to use Ray with dstack, you have to first run a Ray cluster.
The task below runs a Ray cluster on an existing fleet:
type: task
name: ray-ragen-cluster
nodes: 2
env:
- WANDB_API_KEY
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
commands:
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- bash miniconda.sh -b -p /workflow/miniconda
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
- git clone https://github.com/RAGEN-AI/RAGEN.git
- cd RAGEN
- bash scripts/setup_ragen.sh
- conda activate ragen
- cd verl
- pip install --no-deps -e .
- pip install hf_transfer hf_xet
- pip uninstall -y ray
- pip install -U "ray[default]"
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
ray start --head --port=6379;
else
ray start --address=$DSTACK_MASTER_NODE_IP:6379
fi
# Expose Ray dashboard port
ports:
- 8265
resources:
gpu: 80GB:8
shm_size: 128GB
# Save checkpoints on the instance
volumes:
- /checkpoints:/checkpoints
We are using verl's docker image for vLLM with FSDP. See Installation for more.
The RAGEN setup script scripts/setup_ragen.sh isolates dependencies within Conda environment.
Note that the Ray setup in the RAGEN environment is missing the dashboard, so we reinstall it using ray[default].
Now, if you run this task via dstack apply, it will automatically forward the Ray's dashboard port to localhost:8265.
$ dstack apply -f examples/distributed-training/ray-ragen/.dstack.yml
As long as the dstack apply is attached, you can use localhost:8265 to submit Ray jobs for execution.
If dstack apply is detached, you can use dstack attach to re-attach.
Submit Ray jobs¶
Before you can submit Ray jobs, ensure to install ray locally:
$ pip install ray
Now you can submit the training job to the Ray cluster which is available at localhost:8265:
$ RAY_ADDRESS=http://localhost:8265
$ ray job submit \
-- bash -c "\
export PYTHONPATH=/workflow/RAGEN; \
cd /workflow/RAGEN; \
/workflow/miniconda/envs/ragen/bin/python train.py \
--config-name base \
system.CUDA_VISIBLE_DEVICES=[0,1,2,3,4,5,6,7] \
model_path=Qwen/Qwen2.5-7B-Instruct \
trainer.experiment_name=agent-fine-tuning-Qwen2.5-7B \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
micro_batch_size_per_gpu=2 \
trainer.default_local_dir=/checkpoints \
trainer.save_freq=50 \
actor_rollout_ref.rollout.tp_size_check=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=4"
Training parameters
actor_rollout_ref.rollout.tensor_model_parallel_size=4, becauseQwen/Qwen2.5-7B-Instructhas 28 attention heads and number of attention heads should be divisible bytensor_model_parallel_sizeactor_rollout_ref.rollout.tp_size_check=False, if Truetensor_model_parallel_sizeshould be equal totrainer.n_gpus_per_nodemicro_batch_size_per_gpu=2, to keep the RAGEN-paper'srollout_filter_ratioandes_managersettings as it is for world size16
Using Ray via dstack is a powerful way to get access to the rich Ray ecosystem while benefiting from dstack's provisioning capabilities.
What's next
- Check the Clusters guide
- Read about distributed tasks and fleets
- Browse Ray's docs for other examples.