Skip to content

Llama 3.1

This example walks you through how to deploy and fine-tuning Llama 3.1 with dstack.

Prerequisites

Once dstack is installed, go ahead clone the repo, and run dstack init.

$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init

Deployment

Running as a task

If you'd like to run Llama 3.1 for development purposes, consider using dstack tasks. You can use any serving framework, such as vLLM, TGI, or Ollama. Below is the configuration file for the task.

type: task
name: llama31-task-vllm

python: "3.10"

env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODE_LEN=4096
commands:
  - pip install vllm
  - vllm serve $MODEL_ID
    --tensor-parallel-size $DSTACK_GPUS_NUM
    --max-model-len $MAX_MODEL_LEN
ports: [8000]

# Use either spot or on-demand instances
spot_policy: auto

resources:
  # Required resources
  gpu: 24GB
  # Shared memory (required by multi-gpu)
  shm_size: 24GB

type: task
name: llama31-task-tgi

image: ghcr.io/huggingface/text-generation-inference:latest

env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_INPUT_LENGTH=4000
  - MAX_TOTAL_TOKENS=4096
commands:
  - NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
ports: [80]

# Use either spot or on-demand instances
spot_policy: auto

resources:
  # Required resources
  gpu: 24GB
  # Shared memory (required by multi-gpu)
  shm_size: 24GB

type: task
name: llama31-task-ollama    

image: ollama/ollama
commands:
  - ollama serve &
  - sleep 3
  - ollama pull llama3.1
  - fg
port: 11434

resources:
  gpu: 24GB

# Use either spot or on-demand instances
spot_policy: auto

# Required resources
resources:
  gpu: 24GB

Note, when using Llama 3.1 8B with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory.

Deploying as a service

If you'd like to deploy Llama 3.1 as public auto-scalable and secure endpoint, consider using dstack services.

Memory requirements

Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.

Model size FP16 FP8 INT4
8B 16GB 8GB 4GB
70B 140GB 70GB 35GB
405B 810GB 405GB 203GB

For example, the FP16 version of Llama 3.1 405B won't fit into a single machine with eight 80GB GPUs, so we'd need at least two nodes.

Quantization

The INT4 version of Llama 3.1 70B, can fit into two 40GB GPUs.

The INT4 version of Llama 3.1 405B can fit into eight 40GB GPUs.

Useful links:

Running a configuration

To run a configuration, use the dstack apply command.

$ HUGGING_FACE_HUB_TOKEN=...

$ dstack apply -f examples/llms/llama31/vllm/task.dstack.yml

 #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB    yes   $0.12
 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB    yes   $0.12
 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:2  yes   $0.23

Submit the run llama31-task-vllm? [y/n]: y

Provisioning...
---> 100%

If you run a task, dstack apply automatically forwards the remote ports to localhost for convenient access.

$ curl 127.0.0.1:8001/v1/chat/completions \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "max_tokens": 128
    }'

Fine-tuning with TRL

Running on multiple GPUs

Here is the task configuration file for fine-tuning Llama 3.1 8B on the OpenAssistant/oasst_top1_2023-08-25 dataset.

type: task
name: trl-train

python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention) 
nvcc: true

env:
  - HUGGING_FACE_HUB_TOKEN
  - WANDB_API_KEY
commands:
  - pip install "transformers>=4.43.2"
  - pip install bitsandbytes
  - pip install flash-attn --no-build-isolation
  - pip install peft
  - pip install wandb
  - git clone https://github.com/huggingface/trl
  - cd trl
  - pip install .
  - accelerate launch
    --config_file=examples/accelerate_configs/multi_gpu.yaml
    --num_processes $DSTACK_GPUS_PER_NODE 
    examples/scripts/sft.py
    --model_name meta-llama/Meta-Llama-3.1-8B
    --dataset_name OpenAssistant/oasst_top1_2023-08-25
    --dataset_text_field="text"
    --per_device_train_batch_size 1
    --per_device_eval_batch_size 1
    --gradient_accumulation_steps 4
    --learning_rate 2e-4
    --report_to wandb
    --bf16
    --max_seq_length 1024
    --lora_r 16 --lora_alpha 32
    --lora_target_modules q_proj k_proj v_proj o_proj
    --load_in_4bit
    --use_peft
    --attn_implementation "flash_attention_2"
    --logging_steps=10
    --output_dir models/llama31
    --hub_model_id peterschmidt85/FineLlama-3.1-8B

resources:
gpu:
  # 24GB or more vRAM
  memory: 24GB..
  # One or more GPU
  count: 1..
# Shared memory (for multi-gpu)
shm_size: 24GB

Change the resources property to specify more GPUs.

Memory requirements

Below are the approximate memory requirements for fine-tuning Llama 3.1.

Model size Full fine-tuning LoRA QLoRA
8B 60GB 16GB 6GB
70B 500GB 160GB 48GB
405B 3.25TB 950GB 250GB

The requirements can be significantly reduced with certain optimizations.

DeepSpeed

For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.

To do this, use the examples/accelerate_configs/deepspeed_zero3.yaml configuration file instead of examples/accelerate_configs/multi_gpu.yaml.

Running on multiple nodes

In case the model doesn't feet into a single GPU, consider running a dstack task on multiple nodes. Below is the corresponding task configuration file.

type: task
name: trl-train-distrib

# Size of the cluster
nodes: 2

python: "3.10"
# Ensure nvcc is installed (req. for Flash Attention) 
nvcc: true

env:
  - HUGGING_FACE_HUB_TOKEN
  - WANDB_API_KEY
commands:
  - pip install "transformers>=4.43.2"
  - pip install bitsandbytes
  - pip install flash-attn --no-build-isolation
  - pip install peft
  - pip install wandb
  - git clone https://github.com/huggingface/trl
  - cd trl
  - pip install .
  - accelerate launch
    --config_file=examples/accelerate_configs/fsdp_qlora.yaml 
    --main_process_ip=$DSTACK_MASTER_NODE_IP
    --main_process_port=8008
    --machine_rank=$DSTACK_NODE_RANK
    --num_processes=$DSTACK_GPUS_NUM
    --num_machines=$DSTACK_NODES_NUM
      examples/scripts/sft.py
    --model_name meta-llama/Meta-Llama-3.1-8B
    --dataset_name OpenAssistant/oasst_top1_2023-08-25
    --dataset_text_field="text"
    --per_device_train_batch_size 1
    --per_device_eval_batch_size 1
    --gradient_accumulation_steps 4
    --learning_rate 2e-4
    --report_to wandb
    --bf16
    --max_seq_length 1024
    --lora_r 16 --lora_alpha 32
    --lora_target_modules q_proj k_proj v_proj o_proj
    --load_in_4bit
    --use_peft
    --attn_implementation "flash_attention_2"
    --logging_steps=10
    --output_dir models/llama31
    --hub_model_id peterschmidt85/FineLlama-3.1-8B
    --torch_dtype bfloat16
    --use_bnb_nested_quant

resources:
  gpu:
    # 24GB or more vRAM
    memory: 24GB..
    # One or more GPU
    count: 1..
  # Shared memory (for multi-gpu)
  shm_size: 24GB

Source code

The source-code of this example can be found in examples/llms/llama31 and examples/fine-tuning/trl .

What's next?

  1. Check dev environments, tasks, services, and protips.
  2. Browse Llama 3.1 on HuggingFace , HuggingFace's Llama recipes , Meta's Llama recipes and Llama Agentic System .