Skip to content

Llama

This example walks you through how to deploy Llama 4 Scout model with dstack.

Prerequisites

Once dstack is installed, go ahead clone the repo, and run dstack init.

$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init

Deployment

AMD

Here's an example of a service that deploys Llama-4-Scout-17B-16E-Instruct using vLLM with AMD MI300X GPUs.

type: service
name: llama4-scout

image: rocm/vllm-dev:llama4-20250407
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - VLLM_WORKER_MULTIPROC_METHOD=spawn
  - VLLM_USE_MODELSCOPE=False
  - VLLM_USE_TRITON_FLASH_ATTN=0 
  - MAX_MODEL_LEN=256000

commands:
   - |
     vllm serve $MODEL_ID \
       --tensor-parallel-size $DSTACK_GPUS_NUM \
       --max-model-len $MAX_MODEL_LEN \
       --kv-cache-dtype fp8 \
       --max-num-seqs 64 \
       --override-generation-config='{"attn_temperature_tuning": true}'


port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: Mi300x:2
  disk: 500GB..

NVIDIA

Here's an example of a service that deploys Llama-4-Scout-17B-16E-Instruct using SGLang and vLLM with NVIDIA H200 GPUs.

type: service
name: llama4-scout

image: lmsysorg/sglang
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - CONTEXT_LEN=256000
commands:
   - python3 -m sglang.launch_server
       --model-path $MODEL_ID
       --tp $DSTACK_GPUS_NUM
       --context-length $CONTEXT_LEN
       --kv-cache-dtype fp8_e5m2
       --port 8000

port: 8000
## Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: H200:2
  disk: 500GB..

type: service
name: llama4-scout

image: vllm/vllm-openai
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - VLLM_DISABLE_COMPILE_CACHE=1
  - MAX_MODEL_LEN=256000
commands:
   - |
     vllm serve $MODEL_ID \
       --tensor-parallel-size $DSTACK_GPUS_NUM \
       --max-model-len $MAX_MODEL_LEN \
       --kv-cache-dtype fp8 \
       --override-generation-config='{"attn_temperature_tuning": true}'

port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: H200:2
  disk: 500GB..

NOTE:

With vLLM, add --override-generation-config='{"attn_temperature_tuning": true}' to improve accuracy for contexts longer than 32K tokens .

Memory requirements

Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.

Model Size FP16 FP8 INT4
Behemoth 2T 4TB 2TB 1TB
Maverick 400B 800GB 200GB 100GB
Scout 109B 218GB 109GB 54.5GB

Running a configuration

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml

 #  BACKEND  REGION     RESOURCES                      SPOT PRICE   
 1  vastai   is-iceland 48xCPU, 128GB, 2xH200 (140GB)  no   $7.87   
 2  runpod   EU-SE-1    40xCPU, 128GB, 2xH200 (140GB)  no   $7.98  


Submit the run llama4-scout? [y/n]: y

Provisioning...
---> 100%

Once the service is up, it will be available via the service endpoint at <dstack server URL>/proxy/services/<project name>/<run name>/.

curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "stream": true,
      "max_tokens": 512
    }'

When a gateway is configured, the service endpoint is available at https://<run name>.<gateway domain>/.

Fine-tuning

Here's and example of FSDP and QLoRA fine-tuning of 4-bit Quantized Llama-4-Scout-17B-16E on 2xH100 NVIDIA GPUs using Axolotl

type: task
# The name is optional, if not specified, generated randomly
name: axolotl-nvidia-llama-scout-train

# Using the official Axolotl's Docker image
image: axolotlai/axolotl:main-latest

# Required environment variables
env:
  - HF_TOKEN
  - WANDB_API_KEY
  - WANDB_PROJECT
  - WANDB_NAME=axolotl-nvidia-llama-scout-train
  - HUB_MODEL_ID
# Commands of the task
commands:
  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
  - axolotl train scout-qlora-fsdp1.yaml 
            --wandb-project $WANDB_PROJECT 
            --wandb-name $WANDB_NAME 
            --hub-model-id $HUB_MODEL_ID

resources:
  # Two GPU (required by FSDP)
  gpu: H100:2
  # Shared memory size for inter-process communication
  shm_size: 24GB
  disk: 500GB..

The task uses Axolotl's Docker image, where Axolotl is already pre-installed.

Memory requirements

Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.

Model Size Full fine-tuning LoRA QLoRA
Behemoth 2T 32TB 4.3TB 1.3TB
Maverick 400B 6.5TB 864GB 264GB
Scout 109B 1.75TB 236GB 72GB

The memory estimates assume FP16 precision for model weights, with low-rank adaptation (LoRA/QLoRA) layers comprising 1% of the total model parameters.

Fine-tuning type Calculation
Full fine-tuning 2T × 16 bytes = 32TB
LoRA 2T × 2 bytes + 1% of 2T × 16 bytes = 4.3TB
QLoRA(4-bit) 2T × 0.5 bytes + 1% of 2T × 16 bytes = 1.3TB

Running a configuration

Once the configuration is ready, run dstack apply -f <configuration file>, and dstack will automatically provision the cloud resources and run the configuration.

$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ WANDB_PROJECT=...
$ WANDB_NAME=axolotl-nvidia-llama-scout-train
$ HUB_MODEL_ID=...
$ dstack apply -f examples/fine-tuning/axolotl/.dstack.yml

Source code

The source-code for deployment examples can be found in examples/llms/llama and the source-code for the finetuning example can be found in examples/fine-tuning/axolotl .

What's next?

  1. Check dev environments, tasks, services, and protips.
  2. Browse Llama 4 with SGLang , Llama 4 with vLLM , Llama 4 with AMD and Axolotl .