Llama¶

This example walks you through how to deploy Llama 4 Scout model with dstack.

Prerequisites

Once dstack is installed, go ahead clone the repo, and run dstack init.

$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init

Deployment¶

AMD¶

Here's an example of a service that deploys Llama-4-Scout-17B-16E-Instruct using vLLM with AMD MI300X GPUs.

type: service
name: llama4-scout

image: rocm/vllm-dev:llama4-20250407
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - VLLM_WORKER_MULTIPROC_METHOD=spawn
  - VLLM_USE_MODELSCOPE=False
  - VLLM_USE_TRITON_FLASH_ATTN=0 
  - MAX_MODEL_LEN=256000

commands:
   - |
     vllm serve $MODEL_ID \
       --tensor-parallel-size $DSTACK_GPUS_NUM \
       --max-model-len $MAX_MODEL_LEN \
       --kv-cache-dtype fp8 \
       --max-num-seqs 64 \
       --override-generation-config='{"attn_temperature_tuning": true}'


port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: Mi300x:2
  disk: 500GB..

NVIDIA¶

Here's an example of a service that deploys Llama-4-Scout-17B-16E-Instruct using SGLang and vLLM with NVIDIA H200 GPUs.

SGLangvLLM

type: service
name: llama4-scout

image: lmsysorg/sglang
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - CONTEXT_LEN=256000
commands:
   - python3 -m sglang.launch_server
       --model-path $MODEL_ID
       --tp $DSTACK_GPUS_NUM
       --context-length $CONTEXT_LEN
       --kv-cache-dtype fp8_e5m2
       --port 8000

port: 8000
## Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: H200:2
  disk: 500GB..

type: service
name: llama4-scout

image: vllm/vllm-openai
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - VLLM_DISABLE_COMPILE_CACHE=1
  - MAX_MODEL_LEN=256000
commands:
   - |
     vllm serve $MODEL_ID \
       --tensor-parallel-size $DSTACK_GPUS_NUM \
       --max-model-len $MAX_MODEL_LEN \
       --kv-cache-dtype fp8 \
       --override-generation-config='{"attn_temperature_tuning": true}'

port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: H200:2
  disk: 500GB..

NOTE:

With vLLM, add --override-generation-config='{"attn_temperature_tuning": true}' to improve accuracy for contexts longer than 32K tokens .

Memory requirements¶

Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.

Model	Size	FP16	FP8	INT4
`Behemoth`	2T	4TB	2TB	1TB
`Maverick`	400B	800GB	200GB	100GB
`Scout`	109B	218GB	109GB	54.5GB

Running a configuration¶

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml

 #  BACKEND  REGION     RESOURCES                      SPOT PRICE   
 1  vastai   is-iceland 48xCPU, 128GB, 2xH200 (140GB)  no   $7.87   
 2  runpod   EU-SE-1    40xCPU, 128GB, 2xH200 (140GB)  no   $7.98  


Submit the run llama4-scout? [y/n]: y

Provisioning...
---> 100%

Once the service is up, it will be available via the service endpoint at <dstack server URL>/proxy/services/<project name>/<run name>/.

curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "stream": true,
      "max_tokens": 512
    }'

When a gateway is configured, the service endpoint is available at https://<run name>.<gateway domain>/.

Fine-tuning¶

Here's and example of FSDP and QLoRA fine-tuning of 4-bit Quantized Llama-4-Scout-17B-16E on 2xH100 NVIDIA GPUs using Axolotl

type: task
# The name is optional, if not specified, generated randomly
name: axolotl-nvidia-llama-scout-train

# Using the official Axolotl's Docker image
image: axolotlai/axolotl:main-latest

# Required environment variables
env:
  - HF_TOKEN
  - WANDB_API_KEY
  - WANDB_PROJECT
  - WANDB_NAME=axolotl-nvidia-llama-scout-train
  - HUB_MODEL_ID
# Commands of the task
commands:
  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
  - axolotl train scout-qlora-fsdp1.yaml 
            --wandb-project $WANDB_PROJECT 
            --wandb-name $WANDB_NAME 
            --hub-model-id $HUB_MODEL_ID

resources:
  # Two GPU (required by FSDP)
  gpu: H100:2
  # Shared memory size for inter-process communication
  shm_size: 24GB
  disk: 500GB..

The task uses Axolotl's Docker image, where Axolotl is already pre-installed.

Memory requirements¶

Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.

Model	Size	Full fine-tuning	LoRA	QLoRA
`Behemoth`	2T	32TB	4.3TB	1.3TB
`Maverick`	400B	6.5TB	864GB	264GB
`Scout`	109B	1.75TB	236GB	72GB

The memory estimates assume FP16 precision for model weights, with low-rank adaptation (LoRA/QLoRA) layers comprising 1% of the total model parameters.

Fine-tuning type	Calculation
Full fine-tuning	2T × 16 bytes = 32TB
LoRA	2T × 2 bytes + 1% of 2T × 16 bytes = 4.3TB
QLoRA(4-bit)	2T × 0.5 bytes + 1% of 2T × 16 bytes = 1.3TB

Running a configuration¶

Once the configuration is ready, run dstack apply -f <configuration file>, and dstack will automatically provision the cloud resources and run the configuration.

$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ WANDB_PROJECT=...
$ WANDB_NAME=axolotl-nvidia-llama-scout-train
$ HUB_MODEL_ID=...
$ dstack apply -f examples/single-node-training/axolotl/.dstack.yml

Source code¶

The source-code for deployment examples can be found in examples/llms/llama and the source-code for the finetuning example can be found in examples/single-node-training/axolotl .

What's next?¶

Check dev environments, tasks, services, and protips.
Browse Llama 4 with SGLang , Llama 4 with vLLM , Llama 4 with AMD and Axolotl .