Skip to content


This example demonstrates how to use vLLM with dstack's services to deploy LLMs.

Define the configuration

To deploy an LLM as a service using vLLM, you have to define the following configuration file:

type: service

image: vllm/vllm-openai:latest
  - MODEL=NousResearch/Llama-2-7b-chat-hf
  - PYTHONPATH=/workspace 
  - python3 -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

  gpu: 24GB

# (Optional) Enable the OpenAI-compatible endpoint
  format: openai
  type: chat
  name: NousResearch/Llama-2-7b-chat-hf

Run the configuration


Before running a service, ensure that you have configured a gateway. If you're using dstack Sky, the default gateway is configured automatically for you.

$ dstack run . -f deployment/vllm/serve.dstack.yml

Access the endpoint

Once the service is up, you can query it at https://<run name>.<gateway domain> (using the domain set up for the gateway):


By default, the service endpoint requires the Authorization header with "Bearer <dstack token>".

OpenAI interface

Because we've configured the model mapping, it will also be possible to access the model at https://gateway.<gateway domain> via the OpenAI-compatible interface.

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.<gateway domain>", 
    api_key="<dstack token>"

completion =
            "role": "user",
            "content": "Compose a poem that explains the concept of recursion in programming.",

for chunk in completion:
    print(chunk.choices[0].delta.content, end="")

Hugging Face Hub token

To use a model with gated access, ensure configuring the HUGGING_FACE_HUB_TOKEN environment variable (with --env in dstack run or using env in the configuration file).

Source code

The complete, ready-to-run code is available in dstackai/dstack-examples.

What's next?

  1. Check the Text Generation Inference example
  2. Read about services
  3. Browse examples
  4. Join the Discord server