Llama¶
This example walks you through how to deploy Llama 4 Scout model with dstack.
Prerequisites
Once dstack is installed, clone the repo with examples.
$ git clone https://github.com/dstackai/dstack
$ cd dstack
Deployment¶
AMD¶
Here's an example of a service that deploys
Llama-4-Scout-17B-16E-Instruct
using vLLM
with AMD MI300X GPUs.
type: service
name: llama4-scout
image: rocm/vllm-dev:llama4-20250407
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- VLLM_USE_MODELSCOPE=False
- VLLM_USE_TRITON_FLASH_ATTN=0
- MAX_MODEL_LEN=256000
commands:
- |
vllm serve $MODEL_ID \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len $MAX_MODEL_LEN \
--kv-cache-dtype fp8 \
--max-num-seqs 64 \
--override-generation-config='{"attn_temperature_tuning": true}'
port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: Mi300x:2
disk: 500GB..
NVIDIA¶
Here's an example of a service that deploys
Llama-4-Scout-17B-16E-Instruct
using SGLang and vLLM
with NVIDIA H200 GPUs.
type: service
name: llama4-scout
image: lmsysorg/sglang
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- CONTEXT_LEN=256000
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--tp $DSTACK_GPUS_NUM
--context-length $CONTEXT_LEN
--kv-cache-dtype fp8_e5m2
--port 8000
port: 8000
## Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: H200:2
disk: 500GB..
type: service
name: llama4-scout
image: vllm/vllm-openai
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- VLLM_DISABLE_COMPILE_CACHE=1
- MAX_MODEL_LEN=256000
commands:
- |
vllm serve $MODEL_ID \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len $MAX_MODEL_LEN \
--kv-cache-dtype fp8 \
--override-generation-config='{"attn_temperature_tuning": true}'
port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: H200:2
disk: 500GB..
NOTE:
With vLLM, add --override-generation-config='{"attn_temperature_tuning": true}' to
improve accuracy for contexts longer than 32K tokens.
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
| Model | Size | FP16 | FP8 | INT4 |
|---|---|---|---|---|
Behemoth |
2T | 4TB | 2TB | 1TB |
Maverick |
400B | 800GB | 200GB | 100GB |
Scout |
109B | 218GB | 109GB | 54.5GB |
Running a configuration¶
To run a configuration, use the dstack apply command.
$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87
2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98
Submit the run llama4-scout? [y/n]: y
Provisioning...
---> 100%
Once the service is up, it will be available via the service endpoint
at <dstack server URL>/proxy/services/<project name>/<run name>/.
curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"stream": true,
"max_tokens": 512
}'
When a gateway is configured, the service endpoint
is available at https://<run name>.<gateway domain>/.
Fine-tuning¶
Here's and example of FSDP and QLoRA fine-tuning of 4-bit Quantized Llama-4-Scout-17B-16E on 2xH100 NVIDIA GPUs using Axolotl
type: task
# The name is optional, if not specified, generated randomly
name: axolotl-nvidia-llama-scout-train
# Using the official Axolotl's Docker image
image: axolotlai/axolotl:main-latest
# Required environment variables
env:
- HF_TOKEN
- WANDB_API_KEY
- WANDB_PROJECT
- WANDB_NAME=axolotl-nvidia-llama-scout-train
- HUB_MODEL_ID
# Commands of the task
commands:
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
- axolotl train scout-qlora-fsdp1.yaml
--wandb-project $WANDB_PROJECT
--wandb-name $WANDB_NAME
--hub-model-id $HUB_MODEL_ID
resources:
# Two GPU (required by FSDP)
gpu: H100:2
# Shared memory size for inter-process communication
shm_size: 24GB
disk: 500GB..
The task uses Axolotl's Docker image, where Axolotl is already pre-installed.
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
| Model | Size | Full fine-tuning | LoRA | QLoRA |
|---|---|---|---|---|
Behemoth |
2T | 32TB | 4.3TB | 1.3TB |
Maverick |
400B | 6.5TB | 864GB | 264GB |
Scout |
109B | 1.75TB | 236GB | 72GB |
The memory estimates assume FP16 precision for model weights, with low-rank adaptation (LoRA/QLoRA) layers comprising 1% of the total model parameters.
| Fine-tuning type | Calculation |
|---|---|
| Full fine-tuning | 2T × 16 bytes = 32TB |
| LoRA | 2T × 2 bytes + 1% of 2T × 16 bytes = 4.3TB |
| QLoRA(4-bit) | 2T × 0.5 bytes + 1% of 2T × 16 bytes = 1.3TB |
Running a configuration¶
Once the configuration is ready, run dstack apply -f <configuration file>, and dstack will automatically provision the
cloud resources and run the configuration.
$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ WANDB_PROJECT=...
$ WANDB_NAME=axolotl-nvidia-llama-scout-train
$ HUB_MODEL_ID=...
$ dstack apply -f examples/single-node-training/axolotl/.dstack.yml
Source code¶
The source-code for deployment examples can be found in
examples/llms/llama and the source-code for the finetuning example can be found in examples/single-node-training/axolotl.
What's next?¶
- Check dev environments, tasks, services, and protips.
- Browse Llama 4 with SGLang, Llama 4 with vLLM, Llama 4 with AMD and Axolotl.