Llama 3.2¶
This example walks you through how to deploy Llama 3.2 vision model with dstack
using vLLM
.
Prerequisites
Once dstack
is installed, go ahead clone the repo, and run dstack init
.
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
Deployment¶
Running as a task¶
If you'd like to run Llama 3.2 vision model for development purposes, consider using dstack
tasks.
type: task
name: llama32-task-vllm
# If `image` is not specified, dstack uses its default image
python: "3.10"
# Required environment variables
env:
- HUGGING_FACE_HUB_TOKEN
- MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct
- MAX_MODEL_LEN=13488
- MAX_NUM_SEQS=40
commands:
- pip install vllm
- vllm serve $MODEL_ID
--tensor-parallel-size $DSTACK_GPUS_NUM
--max-model-len $MAX_MODEL_LEN
--max-num-seqs $MAX_NUM_SEQS
--enforce-eager
--disable-log-requests
--limit-mm-per-prompt "image=1"
# Expose the vllm server port
ports:
- 8000
# Use either spot or on-demand instances
spot_policy: auto
resources:
# Required resources
gpu: 48GB
Note, maximum size of vLLM’s KV cache
is 13488
, consequently we must set MAX_MODEL_LEN
to 13488
. MAX_NUM_SEQS
greater than 40 results in an out of memory error.
Deploying as a service¶
If you'd like to deploy Llama 3.2 vision model as public auto-scalable and secure endpoint,
consider using dstack
services.
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
Model size | FP16 |
---|---|
11B | 40GB |
90B | 180GB |
Running a configuration¶
To run a configuration, use the dstack apply
command.
$ HUGGING_FACE_HUB_TOKEN=...
$ dstack apply -f examples/llms/llama32/vllm/task.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24
2 runpod EU-SE-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24
3 runpod EU-SE-1 9xCPU, 50GB, 1xA6000 (48GB) yes $0.25
Submit the run llama32-task-vllm? [y/n]: y
Provisioning...
---> 100%
If you run a task, dstack apply
automatically forwards the remote ports to localhost
for convenient access.
$ curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer token' \
--data '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Describe the image."},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/e/ea/Bento_at_Hanabishi%2C_Koyasan.jpg"}}
]
}],
"max_tokens": 2048
}'
Source code¶
The source-code of this example can be found in
examples/llms/llama32
.
What's next?¶
- Check dev environments, tasks, services, and protips.
- Browse Llama 3.2 on HuggingFace and LLama 3.2 on vLLM .