Llama 3.1¶
This example walks you through how to deploy and fine-tuning Llama 3.1 with dstack
.
Prerequisites
Once dstack
is installed, go ahead clone the repo, and run dstack init
.
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
Deployment¶
Running as a task¶
If you'd like to run Llama 3.1 for development purposes, consider using dstack
tasks.
You can use any serving framework, such as vLLM, TGI, or Ollama.
Below is the configuration file for the task.
type: task
name: llama31-task-vllm
python: "3.10"
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODE_LEN=4096
commands:
- pip install vllm
- vllm serve $MODEL_ID
--tensor-parallel-size $DSTACK_GPUS_NUM
--max-model-len $MAX_MODEL_LEN
ports: [8000]
# Use either spot or on-demand instances
spot_policy: auto
resources:
# Required resources
gpu: 24GB
# Shared memory (required by multi-gpu)
shm_size: 24GB
type: task
name: llama31-task-tgi
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_INPUT_LENGTH=4000
- MAX_TOTAL_TOKENS=4096
commands:
- NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
ports: [80]
# Use either spot or on-demand instances
spot_policy: auto
resources:
# Required resources
gpu: 24GB
# Shared memory (required by multi-gpu)
shm_size: 24GB
type: task
name: llama31-task-ollama
image: ollama/ollama
commands:
- ollama serve &
- sleep 3
- ollama pull llama3.1
- fg
port: 11434
resources:
gpu: 24GB
# Use either spot or on-demand instances
spot_policy: auto
# Required resources
resources:
gpu: 24GB
Note, when using Llama 3.1 8B with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory.
Deploying as a service¶
If you'd like to deploy Llama 3.1 as public auto-scalable and secure endpoint,
consider using dstack
services.
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
Model size | FP16 | FP8 | INT4 |
---|---|---|---|
8B | 16GB | 8GB | 4GB |
70B | 140GB | 70GB | 35GB |
405B | 810GB | 405GB | 203GB |
For example, the FP16 version of Llama 3.1 405B won't fit into a single machine with eight 80GB GPUs, so we'd need at least two nodes.
Quantization¶
The INT4 version of Llama 3.1 70B, can fit into two 40GB GPUs.
The INT4 version of Llama 3.1 405B can fit into eight 40GB GPUs.
Useful links:
- Meta's official FP8 quantized version of Llama 3.1 405B (with minimal accuracy degradation)
- Llama 3.1 Quantized Models with quantized checkpoints
Running a configuration¶
To run a configuration, use the dstack apply
command.
$ HUGGING_FACE_HUB_TOKEN=...
$ dstack apply -f examples/llms/llama31/vllm/task.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23
Submit the run llama31-task-vllm? [y/n]: y
Provisioning...
---> 100%
If you run a task, dstack apply
automatically forwards the remote ports to localhost
for convenient access.
$ curl 127.0.0.1:8001/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"max_tokens": 128
}'
Fine-tuning¶
Running on multiple GPUs¶
Below is the task configuration file of fine-tuning Llama 3.1 8B using TRL on the
OpenAssistant/oasst_top1_2023-08-25
dataset.
type: task
name: trl-train
python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention)
nvcc: true
env:
- HUGGING_FACE_HUB_TOKEN
- WANDB_API_KEY
commands:
- pip install "transformers>=4.43.2"
- pip install bitsandbytes
- pip install flash-attn --no-build-isolation
- pip install peft
- pip install wandb
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
--config_file=examples/accelerate_configs/multi_gpu.yaml
--num_processes $DSTACK_GPUS_PER_NODE
examples/scripts/sft.py
--model_name meta-llama/Meta-Llama-3.1-8B
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-4
--report_to wandb
--bf16
--max_seq_length 1024
--lora_r 16 --lora_alpha 32
--lora_target_modules q_proj k_proj v_proj o_proj
--load_in_4bit
--use_peft
--attn_implementation "flash_attention_2"
--logging_steps=10
--output_dir models/llama31
--hub_model_id peterschmidt85/FineLlama-3.1-8B
resources:
gpu:
# 24GB or more vRAM
memory: 24GB..
# One or more GPU
count: 1..
# Shared memory (for multi-gpu)
shm_size: 24GB
Change the resources
property to specify more GPUs.
Memory requirements¶
Below are the approximate memory requirements for fine-tuning Llama 3.1.
Model size | Full fine-tuning | LoRA | QLoRA |
---|---|---|---|
8B | 60GB | 16GB | 6GB |
70B | 500GB | 160GB | 48GB |
405B | 3.25TB | 950GB | 250GB |
The requirements can be significantly reduced with certain optimizations.
DeepSpeed¶
For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.
To do this, use the examples/accelerate_configs/deepspeed_zero3.yaml
configuration file instead of
examples/accelerate_configs/multi_gpu.yaml
.
Running on multiple nodes¶
In case the model doesn't feet into a single GPU, consider running a dstack
task on multiple nodes.
Below is the corresponding task configuration file.
type: task
name: trl-train-distrib
# Size of the cluster
nodes: 2
python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention)
nvcc: true
env:
- HUGGING_FACE_HUB_TOKEN
- WANDB_API_KEY
commands:
- pip install "transformers>=4.43.2"
- pip install bitsandbytes
- pip install flash-attn --no-build-isolation
- pip install peft
- pip install wandb
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
--config_file=examples/accelerate_configs/fsdp_qlora.yaml
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM
examples/scripts/sft.py
--model_name meta-llama/Meta-Llama-3.1-8B
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-4
--report_to wandb
--bf16
--max_seq_length 1024
--lora_r 16 --lora_alpha 32
--lora_target_modules q_proj k_proj v_proj o_proj
--load_in_4bit
--use_peft
--attn_implementation "flash_attention_2"
--logging_steps=10
--output_dir models/llama31
--hub_model_id peterschmidt85/FineLlama-3.1-8B
--torch_dtype bfloat16
--use_bnb_nested_quant
resources:
gpu:
# 24GB or more vRAM
memory: 24GB..
# One or more GPU
count: 1..
# Shared memory (for multi-gpu)
shm_size: 24GB
Source code¶
The source-code of this example can be found in
examples/llms/llama31
and examples/fine-tuning/trl
.
What's next?¶
- Check dev environments, tasks, services, and protips.
- Browse Llama 3.1 on HuggingFace , HuggingFace's Llama recipes , Meta's Llama recipes and Llama Agentic System .