Ray + RAGEN¶
This example shows how use dstack
and RAGEN
to fine-tune an agent on mulitiple nodes.
Under the hood RAGEN
uses verl for Reinforcement Learning and Ray for ditributed training.
Create fleet¶
Before submitted disributed training runs, make sure to create a fleet with a placement
set to cluster
.
For more detials on how to use clusters with
dstack
, check the Clusters guide.
Run a Ray cluster¶
If you want to use Ray with dstack
, you have to first run a Ray cluster.
The task below runs a Ray cluster on an existing fleet:
type: task
name: ray-ragen-cluster
nodes: 2
env:
- WANDB_API_KEY
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
commands:
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- bash miniconda.sh -b -p /workflow/miniconda
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
- git clone https://github.com/RAGEN-AI/RAGEN.git
- cd RAGEN
- bash scripts/setup_ragen.sh
- conda activate ragen
- cd verl
- pip install --no-deps -e .
- pip install hf_transfer hf_xet
- pip uninstall -y ray
- pip install -U "ray[default]"
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
ray start --head --port=6379;
else
ray start --address=$DSTACK_MASTER_NODE_IP:6379
fi
# Expose Ray dashboard port
ports:
- 8265
resources:
gpu: 80GB:8
shm_size: 128GB
# Save checkpoints on the instance
volumes:
- /checkpoints:/checkpoints
We are using verl's docker image for vLLM with FSDP. See Installation for more.
The RAGEN
setup script scripts/setup_ragen.sh
isolates dependencies within Conda environment.
Note that the Ray setup in the RAGEN environment is missing the dashboard, so we reinstall it using ray[default]
.
Now, if you run this task via dstack apply
, it will automatically forward the Ray's dashboard port to localhost:8265
.
$ dstack apply -f examples/distributed-training/ray-ragen/.dstack.yml
As long as the dstack apply
is attached, you can use localhost:8265
to submit Ray jobs for execution.
If dstack apply
is detached, you can use dstack attach
to re-attach.
Submit Ray jobs¶
Before you can submit Ray jobs, ensure to install ray
locally:
$ pip install ray
Now you can submit the training job to the Ray cluster which is available at localhost:8265
:
$ RAY_ADDRESS=http://localhost:8265
$ ray job submit \
-- bash -c "\
export PYTHONPATH=/workflow/RAGEN; \
cd /workflow/RAGEN; \
/workflow/miniconda/envs/ragen/bin/python train.py \
--config-name base \
system.CUDA_VISIBLE_DEVICES=[0,1,2,3,4,5,6,7] \
model_path=Qwen/Qwen2.5-7B-Instruct \
trainer.experiment_name=agent-fine-tuning-Qwen2.5-7B \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
micro_batch_size_per_gpu=2 \
trainer.default_local_dir=/checkpoints \
trainer.save_freq=50 \
actor_rollout_ref.rollout.tp_size_check=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=4"
Training parameters
actor_rollout_ref.rollout.tensor_model_parallel_size=4
, becauseQwen/Qwen2.5-7B-Instruct
has 28 attention heads and number of attention heads should be divisible bytensor_model_parallel_size
actor_rollout_ref.rollout.tp_size_check=False
, if Truetensor_model_parallel_size
should be equal totrainer.n_gpus_per_node
micro_batch_size_per_gpu=2
, to keep the RAGEN-paper'srollout_filter_ratio
andes_manager
settings as it is for world size16
Using Ray via dstack
is a powerful way to get access to the rich Ray ecosystem while benefiting from dstack
's provisioning capabilities.
What's next
- Check the Clusters guide
- Read about distributed tasks and fleets
- Browse Ray's docs for other examples.