Llama¶
This example walks you through how to deploy Llama 4 Scout model with dstack
.
Prerequisites
Once dstack
is installed, go ahead clone the repo, and run dstack init
.
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
Deployment¶
AMD¶
Here's an example of a service that deploys
Llama-4-Scout-17B-16E-Instruct
using vLLM
with AMD MI300X
GPUs.
type: service
name: llama4-scout
image: rocm/vllm-dev:llama4-20250407
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- VLLM_USE_MODELSCOPE=False
- VLLM_USE_TRITON_FLASH_ATTN=0
- MAX_MODEL_LEN=256000
commands:
- |
vllm serve $MODEL_ID \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len $MAX_MODEL_LEN \
--kv-cache-dtype fp8 \
--max-num-seqs 64 \
--override-generation-config='{"attn_temperature_tuning": true}'
port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: Mi300x:2
disk: 500GB..
NVIDIA¶
Here's an example of a service that deploys
Llama-4-Scout-17B-16E-Instruct
using SGLang and vLLM
with NVIDIA H200
GPUs.
type: service
name: llama4-scout
image: lmsysorg/sglang
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- CONTEXT_LEN=256000
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--tp $DSTACK_GPUS_NUM
--context-length $CONTEXT_LEN
--kv-cache-dtype fp8_e5m2
--port 8000
port: 8000
## Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: H200:2
disk: 500GB..
type: service
name: llama4-scout
image: vllm/vllm-openai
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- VLLM_DISABLE_COMPILE_CACHE=1
- MAX_MODEL_LEN=256000
commands:
- |
vllm serve $MODEL_ID \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len $MAX_MODEL_LEN \
--kv-cache-dtype fp8 \
--override-generation-config='{"attn_temperature_tuning": true}'
port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: H200:2
disk: 500GB..
NOTE:
With vLLM, add --override-generation-config='{"attn_temperature_tuning": true}'
to
improve accuracy for contexts longer than 32K tokens .
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
Model | Size | FP16 | FP8 | INT4 |
---|---|---|---|---|
Behemoth |
2T | 4TB | 2TB | 1TB |
Maverick |
400B | 800GB | 200GB | 100GB |
Scout |
109B | 218GB | 109GB | 54.5GB |
Running a configuration¶
To run a configuration, use the dstack apply
command.
$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87
2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98
Submit the run llama4-scout? [y/n]: y
Provisioning...
---> 100%
Once the service is up, it will be available via the service endpoint
at <dstack server URL>/proxy/services/<project name>/<run name>/
.
curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"stream": true,
"max_tokens": 512
}'
When a gateway is configured, the service endpoint
is available at https://<run name>.<gateway domain>/
.
Fine-tuning¶
Here's and example of FSDP and QLoRA fine-tuning of 4-bit Quantized Llama-4-Scout-17B-16E on 2xH100 NVIDIA GPUs using Axolotl
type: task
# The name is optional, if not specified, generated randomly
name: axolotl-nvidia-llama-scout-train
# Using the official Axolotl's Docker image
image: axolotlai/axolotl:main-latest
# Required environment variables
env:
- HF_TOKEN
- WANDB_API_KEY
- WANDB_PROJECT
- WANDB_NAME=axolotl-nvidia-llama-scout-train
- HUB_MODEL_ID
# Commands of the task
commands:
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
- axolotl train scout-qlora-fsdp1.yaml
--wandb-project $WANDB_PROJECT
--wandb-name $WANDB_NAME
--hub-model-id $HUB_MODEL_ID
resources:
# Two GPU (required by FSDP)
gpu: H100:2
# Shared memory size for inter-process communication
shm_size: 24GB
disk: 500GB..
The task uses Axolotl's Docker image, where Axolotl is already pre-installed.
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
Model | Size | Full fine-tuning | LoRA | QLoRA |
---|---|---|---|---|
Behemoth |
2T | 32TB | 4.3TB | 1.3TB |
Maverick |
400B | 6.5TB | 864GB | 264GB |
Scout |
109B | 1.75TB | 236GB | 72GB |
The memory estimates assume FP16 precision for model weights, with low-rank adaptation (LoRA/QLoRA) layers comprising 1% of the total model parameters.
Fine-tuning type | Calculation |
---|---|
Full fine-tuning | 2T × 16 bytes = 32TB |
LoRA | 2T × 2 bytes + 1% of 2T × 16 bytes = 4.3TB |
QLoRA(4-bit) | 2T × 0.5 bytes + 1% of 2T × 16 bytes = 1.3TB |
Running a configuration¶
Once the configuration is ready, run dstack apply -f <configuration file>
, and dstack
will automatically provision the
cloud resources and run the configuration.
$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ WANDB_PROJECT=...
$ WANDB_NAME=axolotl-nvidia-llama-scout-train
$ HUB_MODEL_ID=...
$ dstack apply -f examples/fine-tuning/axolotl/.dstack.yml
Source code¶
The source-code for deployment examples can be found in
examples/llms/llama
and the source-code for the finetuning example can be found in examples/fine-tuning/axolotl
.
What's next?¶
- Check dev environments, tasks, services, and protips.
- Browse Llama 4 with SGLang , Llama 4 with vLLM , Llama 4 with AMD and Axolotl .