Skip to content

Llama

This example walks you through how to deploy Llama 4 Scout model with dstack.

Prerequisites

Once dstack is installed, go ahead clone the repo, and run dstack init.

$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init

Deployment

Here's an example of a service that deploys Llama-4-Scout-17B-16E-Instruct using SGLang and vLLM with NVIDIA H200 GPUs.

type: service
name: llama4-scout

image: lmsysorg/sglang
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - CONTEXT_LEN=256000
commands:
   - python3 -m sglang.launch_server
       --model-path $MODEL_ID
       --tp $DSTACK_GPUS_NUM
       --context-length $CONTEXT_LEN
       --kv-cache-dtype fp8_e5m2
       --port 8000

port: 8000
## Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: H200:2
  disk: 500GB..

type: service
name: llama4-scout

image: vllm/vllm-openai
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
  - VLLM_DISABLE_COMPILE_CACHE=1
  - MAX_MODEL_LEN=256000
commands:
   - |
     vllm serve $MODEL_ID \
       --tensor-parallel-size $DSTACK_GPUS_NUM \
       --max-model-len $MAX_MODEL_LEN \
       --kv-cache-dtype fp8 \
       --override-generation-config='{"attn_temperature_tuning": true}'

port: 8000
# Register the model
model: meta-llama/Llama-4-Scout-17B-16E-Instruct

resources:
  gpu: H200:2
  disk: 500GB..

NOTE:

With vLLM, add --override-generation-config='{"attn_temperature_tuning": true}' to improve accuracy for contexts longer than 32K tokens .

Memory requirements

Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.

Model Size FP16 FP8 INT4
Behemoth 2T 4TB 2TB 1TB
Maverick 400B 800GB 200GB 100GB
Scout 109B 218GB 109GB 54.5GB

Running a configuration

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml

 #  BACKEND  REGION     RESOURCES                      SPOT PRICE   
 1  vastai   is-iceland 48xCPU, 128GB, 2xH200 (140GB)  no   $7.87   
 2  runpod   EU-SE-1    40xCPU, 128GB, 2xH200 (140GB)  no   $7.98  


Submit the run llama4-scout? [y/n]: y

Provisioning...
---> 100%

Once the service is up, it will be available via the service endpoint at <dstack server URL>/proxy/services/<project name>/<run name>/.

curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "stream": true,
      "max_tokens": 512
    }'

When a gateway is configured, the service endpoint is available at https://<run name>.<gateway domain>/.

Source code

The source-code of this example can be found in examples/llms/llama .

What's next?

  1. Check dev environments, tasks, services, and protips.
  2. Browse Llama 4 with SGLang and Llama 4 with vLLM .