TRL¶
This example walks you through how to use TRL to fine-tune Llama-3.1-8B
with dstack
using SFT with QLoRA.
Define a configuration¶
Below is a task configuration that does fine-tuning.
type: task
name: trl-train
python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention)
nvcc: true
env:
- HF_TOKEN
- WANDB_API_KEY
- HUB_MODEL_ID
commands:
- pip install "transformers>=4.43.2"
- pip install bitsandbytes
- pip install flash-attn --no-build-isolation
- pip install peft
- pip install wandb
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- |
accelerate launch \
--config_file=examples/accelerate_configs/multi_gpu.yaml \
--num_processes $DSTACK_GPUS_PER_NODE \
examples/scripts/sft.py \
--model_name meta-llama/Meta-Llama-3.1-8B \
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
--dataset_text_field="text" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--report_to wandb \
--bf16 \
--max_seq_length 1024 \
--lora_r 16 --lora_alpha 32 \
--lora_target_modules q_proj k_proj v_proj o_proj \
--load_in_4bit \
--use_peft \
--attn_implementation "flash_attention_2" \
--logging_steps=10 \
--output_dir models/llama31 \
--hub_model_id $HUB_MODEL_ID
resources:
gpu:
# 24GB or more VRAM
memory: 24GB..
# One or more GPU
count: 1..
# Shared memory (for multi-gpu)
shm_size: 24GB
Change the resources
property to specify more GPUs.
AMD
The example above uses NVIDIA accelerators. To use it with AMD, check out AMD.
DeepSpeed
For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.
To do this, use the examples/accelerate_configs/deepspeed_zero3.yaml
configuration file instead of
examples/accelerate_configs/multi_gpu.yaml
.
Run the configuration¶
Once the configuration is ready, run dstack apply -f <configuration file>
, and dstack
will automatically provision the
cloud resources and run the configuration.
$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ HUB_MODEL_ID=...
$ dstack apply -f examples/single-node-training/trl/train.dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 vastai (cz-czechia) cpu=64 mem=128GB H100:80GB:2 18794506 $3.8907
2 vastai (us-texas) cpu=52 mem=64GB H100:80GB:2 20442365 $3.6926
3 vastai (fr-france) cpu=64 mem=96GB H100:80GB:2 20379984 $3.7389
Submit the run trl-train? [y/n]:
Provisioning...
---> 100%
Source code¶
The source-code of this example can be found in
examples/llms/llama31
and examples/single-node-training/trl
.
What's next?¶
- Browse the TRL distributed training example
- Check dev environments, tasks, services, and fleets
- See the AMD example