Llama 3.1¶
This example walks you through how to deploy and fine-tuning Llama 3.1 with dstack
.
Prerequisites
Once dstack
is installed, go ahead clone the repo, and run dstack init
.
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
Deployment¶
You can use any serving frameworks. Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NIM.
type: service
name: llama31
python: "3.11"
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096
commands:
- pip install vllm
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Uncomment to leverage spot instances
#spot_policy: auto
# Uncomment to cache downloaded models
#volumes:
# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
resources:
gpu: 24GB
# Uncomment if using multiple GPUs
#shm_size: 24GB
type: service
name: llama31
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_INPUT_LENGTH=4000
- MAX_TOTAL_TOKENS=4096
commands:
- NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
port: 80
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Uncomment to leverage spot instances
#spot_policy: auto
# Uncomment to cache downloaded models
#volumes:
# - /data:/data
resources:
gpu: 24GB
# Uncomment if using multiple GPUs
#shm_size: 24GB
type: service
name: llama31
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
env:
- NGC_API_KEY
- NIM_MAX_MODEL_LEN=4096
registry_auth:
username: $oauthtoken
password: ${{ env.NGC_API_KEY }}
port: 8000
# Register the model
model: meta/llama-3.1-8b-instruct
# Uncomment to leverage spot instances
#spot_policy: auto
# Cache downloaded models
volumes:
- /root/.cache/nim:/opt/nim/.cache
resources:
gpu: 24GB
# Uncomment if using multiple GPUs
#shm_size: 24GB
Note, when using Llama 3.1 8B with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory.
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
Model size | FP16 | FP8 | INT4 |
---|---|---|---|
8B | 16GB | 8GB | 4GB |
70B | 140GB | 70GB | 35GB |
405B | 810GB | 405GB | 203GB |
For example, the FP16 version of Llama 3.1 405B won't fit into a single machine with eight 80GB GPUs, so we'd need at least two nodes.
Quantization¶
The INT4 version of Llama 3.1 70B, can fit into two 40GB GPUs.
The INT4 version of Llama 3.1 405B can fit into eight 40GB GPUs.
Useful links:
- Meta's official FP8 quantized version of Llama 3.1 405B (with minimal accuracy degradation)
- Llama 3.1 Quantized Models with quantized checkpoints
Running a configuration¶
To run a configuration, use the dstack apply
command.
$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama31/vllm/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23
Submit the run llama31? [y/n]: y
Provisioning...
---> 100%
Once the service is up, the model will be available via the OpenAI-compatible endpoint
at `
$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.1",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
When a gateway is configured, the OpenAI-compatible endpoint
is available at https://gateway.<gateway domain>/
.
Fine-tuning¶
Running on multiple GPUs¶
Below is the task configuration file of fine-tuning Llama 3.1 8B using TRL on the
OpenAssistant/oasst_top1_2023-08-25
dataset.
type: task
name: trl-train
python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention)
nvcc: true
env:
- HF_TOKEN
- WANDB_API_KEY
commands:
- pip install "transformers>=4.43.2"
- pip install bitsandbytes
- pip install flash-attn --no-build-isolation
- pip install peft
- pip install wandb
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
--config_file=examples/accelerate_configs/multi_gpu.yaml
--num_processes $DSTACK_GPUS_PER_NODE
examples/scripts/sft.py
--model_name meta-llama/Meta-Llama-3.1-8B
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-4
--report_to wandb
--bf16
--max_seq_length 1024
--lora_r 16 --lora_alpha 32
--lora_target_modules q_proj k_proj v_proj o_proj
--load_in_4bit
--use_peft
--attn_implementation "flash_attention_2"
--logging_steps=10
--output_dir models/llama31
--hub_model_id peterschmidt85/FineLlama-3.1-8B
resources:
gpu:
# 24GB or more vRAM
memory: 24GB..
# One or more GPU
count: 1..
# Shared memory (for multi-gpu)
shm_size: 24GB
Change the resources
property to specify more GPUs.
Memory requirements¶
Below are the approximate memory requirements for fine-tuning Llama 3.1.
Model size | Full fine-tuning | LoRA | QLoRA |
---|---|---|---|
8B | 60GB | 16GB | 6GB |
70B | 500GB | 160GB | 48GB |
405B | 3.25TB | 950GB | 250GB |
The requirements can be significantly reduced with certain optimizations.
DeepSpeed¶
For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.
To do this, use the examples/accelerate_configs/deepspeed_zero3.yaml
configuration file instead of
examples/accelerate_configs/multi_gpu.yaml
.
Running on multiple nodes¶
In case the model doesn't feet into a single GPU, consider running a dstack
task on multiple nodes.
Below is the corresponding task configuration file.
type: task
name: trl-train-distrib
# Size of the cluster
nodes: 2
python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention)
nvcc: true
env:
- HF_TOKEN
- WANDB_API_KEY
commands:
- pip install "transformers>=4.43.2"
- pip install bitsandbytes
- pip install flash-attn --no-build-isolation
- pip install peft
- pip install wandb
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
--config_file=examples/accelerate_configs/fsdp_qlora.yaml
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM
examples/scripts/sft.py
--model_name meta-llama/Meta-Llama-3.1-8B
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-4
--report_to wandb
--bf16
--max_seq_length 1024
--lora_r 16 --lora_alpha 32
--lora_target_modules q_proj k_proj v_proj o_proj
--load_in_4bit
--use_peft
--attn_implementation "flash_attention_2"
--logging_steps=10
--output_dir models/llama31
--hub_model_id peterschmidt85/FineLlama-3.1-8B
--torch_dtype bfloat16
--use_bnb_nested_quant
resources:
gpu:
# 24GB or more vRAM
memory: 24GB..
# One or more GPU
count: 1..
# Shared memory (for multi-gpu)
shm_size: 24GB
Source code¶
The source-code of this example can be found in
examples/llms/llama31
and examples/fine-tuning/trl
.
What's next?¶
- Check dev environments, tasks, services, and protips.
- Browse Llama 3.1 on HuggingFace , HuggingFace's Llama recipes , Meta's Llama recipes and Llama Agentic System .