Llama¶
This example walks you through how to deploy Llama 4 Scout model with dstack
.
Prerequisites
Once dstack
is installed, go ahead clone the repo, and run dstack init
.
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
Deployment¶
Here's an example of a service that deploys
Llama-4-Scout-17B-16E-Instruct
using SGLang and vLLM
with NVIDIA H200
GPUs.
type: service
name: llama4-scout
image: lmsysorg/sglang
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- CONTEXT_LEN=256000
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
--tp $DSTACK_GPUS_NUM
--context-length $CONTEXT_LEN
--kv-cache-dtype fp8_e5m2
--port 8000
port: 8000
## Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: H200:2
disk: 500GB..
type: service
name: llama4-scout
image: vllm/vllm-openai
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
- VLLM_DISABLE_COMPILE_CACHE=1
- MAX_MODEL_LEN=256000
commands:
- |
vllm serve $MODEL_ID \
--tensor-parallel-size $DSTACK_GPUS_NUM \
--max-model-len $MAX_MODEL_LEN \
--kv-cache-dtype fp8 \
--override-generation-config='{"attn_temperature_tuning": true}'
port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
resources:
gpu: H200:2
disk: 500GB..
NOTE:
With vLLM, add --override-generation-config='{"attn_temperature_tuning": true}'
to
improve accuracy for contexts longer than 32K tokens .
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
Model | Size | FP16 | FP8 | INT4 |
---|---|---|---|---|
Behemoth |
2T | 4TB | 2TB | 1TB |
Maverick |
400B | 800GB | 200GB | 100GB |
Scout |
109B | 218GB | 109GB | 54.5GB |
Running a configuration¶
To run a configuration, use the dstack apply
command.
$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87
2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98
Submit the run llama4-scout? [y/n]: y
Provisioning...
---> 100%
Once the service is up, it will be available via the service endpoint
at <dstack server URL>/proxy/services/<project name>/<run name>/
.
curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Deep Learning?"
}
],
"stream": true,
"max_tokens": 512
}'
When a gateway is configured, the service endpoint
is available at https://<run name>.<gateway domain>/
.
Source code¶
The source-code of this example can be found in
examples/llms/llama
.
What's next?¶
- Check dev environments, tasks, services, and protips.
- Browse Llama 4 with SGLang and Llama 4 with vLLM .