Llama 3.2¶
This example walks you through how to deploy Llama 3.2 vision model with dstack
using vLLM
.
Prerequisites
Once dstack
is installed, go ahead clone the repo, and run dstack init
.
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
Deployment¶
Here's an example of a service that deploys Llama 3.2 11B using vLLM.
type: service
name: llama32
image: vllm/vllm-openai:latest
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct
- MAX_MODEL_LEN=4096
- MAX_NUM_SEQS=8
commands:
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--max-num-seqs $MAX_NUM_SEQS
--enforce-eager
--disable-log-requests
--limit-mm-per-prompt "image=1"
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# Register the model
model: meta-llama/Llama-3.2-11B-Vision-Instruct
# Uncomment to cache downloaded models
#volumes:
# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
resources:
gpu: 40GB..48GB
Memory requirements¶
Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations.
Model size | FP16 |
---|---|
11B | 40GB |
90B | 180GB |
Running a configuration¶
To run a configuration, use the dstack apply
command.
$ HF_TOKEN=...
$ dstack apply -f examples/llms/llama32/vllm/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24
2 runpod EU-SE-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24
3 runpod EU-SE-1 9xCPU, 50GB, 1xA6000 (48GB) yes $0.25
Submit the run llama32? [y/n]: y
Provisioning...
---> 100%
Once the service is up, it will be available via the service endpoint
at <dstack server URL>/proxy/services/<project name>/<run name>/
.
$ curl http://127.0.0.1:3000/proxy/services/main/llama32/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer token' \
--data '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Describe the image."},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/e/ea/Bento_at_Hanabishi%2C_Koyasan.jpg"}}
]
}],
"max_tokens": 2048
}'
When a gateway is configured, the service endpoint
is available at https://<run name>.<gateway domain>/
.
Source code¶
The source-code of this example can be found in
examples/llms/llama32
.
What's next?¶
- Check dev environments, tasks, services, and protips.
- Browse Llama 3.2 on HuggingFace and LLama 3.2 on vLLM .