Skip to content

Llama 3.2

This example walks you through how to deploy Llama 3.2 vision model with dstack using vLLM.

Prerequisites

Once dstack is installed, go ahead clone the repo, and run dstack init.

$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init

Deployment

Running as a task

If you'd like to run Llama 3.2 vision model for development purposes, consider using dstack tasks.

type: task
name: llama32-task-vllm

# If `image` is not specified, dstack uses its default image
python: "3.10"
# Required environment variables
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct
  - MAX_MODEL_LEN=13488
  - MAX_NUM_SEQS=40
commands:
  - pip install vllm
  - vllm serve $MODEL_ID
    --tensor-parallel-size $DSTACK_GPUS_NUM
    --max-model-len $MAX_MODEL_LEN
    --max-num-seqs $MAX_NUM_SEQS
    --enforce-eager
    --disable-log-requests
    --limit-mm-per-prompt "image=1"
# Expose the vllm server port
ports: 
  - 8000
# Use either spot or on-demand instances
spot_policy: auto

resources:
  # Required resources
  gpu: 48GB

Note, maximum size of vLLM’s KV cache is 13488, consequently we must set MAX_MODEL_LEN to 13488. MAX_NUM_SEQS greater than 40 results in an out of memory error.

Deploying as a service

If you'd like to deploy Llama 3.2 vision model as public auto-scalable and secure endpoint, consider using dstack services.

Memory requirements

Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.

Model size FP16
11B 40GB
90B 180GB

Running a configuration

To run a configuration, use the dstack apply command.

$ HUGGING_FACE_HUB_TOKEN=...

$ dstack apply -f examples/llms/llama32/vllm/task.dstack.yml

 #  BACKEND  REGION     RESOURCES                    SPOT  PRICE   
 1  runpod   CA-MTL-1   9xCPU, 50GB, 1xA40 (48GB)    yes   $0.24   
 2  runpod   EU-SE-1    9xCPU, 50GB, 1xA40 (48GB)    yes   $0.24   
 3  runpod   EU-SE-1    9xCPU, 50GB, 1xA6000 (48GB)  yes   $0.25   


Submit the run llama32-task-vllm? [y/n]: y

Provisioning...
---> 100%

If you run a task, dstack apply automatically forwards the remote ports to localhost for convenient access.

$ curl http://localhost:8000/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer token' \
    --data '{
        "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
        "messages": [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "Describe the image."},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/e/ea/Bento_at_Hanabishi%2C_Koyasan.jpg"}}
            ]
        }],
        "max_tokens": 2048
    }'

Source code

The source-code of this example can be found in examples/llms/llama32 .

What's next?

  1. Check dev environments, tasks, services, and protips.
  2. Browse Llama 3.2 on HuggingFace and LLama 3.2 on vLLM .