Skip to content


This example demonstrates how to use Ollama with dstack's services to deploy LLMs.

Define the configuration

To deploy an LLM as a service using vLLM, you have to define the following configuration file:

type: service

image: ollama/ollama
  - ollama serve &
  - sleep 3
  - ollama pull mixtral
  - fg
port: 11434

  gpu: 48GB..80GB

# (Optional) Enable the OpenAI-compatible endpoint
  type: chat
  name: mixtral
  format: openai

Run the configuration


Before running a service, ensure that you have configured a gateway. If you're using dstack Sky, the default gateway is configured automatically for you.

$ dstack run . -f deployment/ollama/serve.dstack.yml

Access the endpoint

Once the service is up, you can query it at https://<run name>.<gateway domain> (using the domain set up for the gateway):


By default, the service endpoint requires the Authorization header with "Bearer <dstack token>".

OpenAI interface

Because we've configured the model mapping, it will also be possible to access the model at https://gateway.<gateway domain> via the OpenAI-compatible interface.

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.<gateway domain>", 
    api_key="<dstack token>",

completion =
            "role": "user",
            "content": "Compose a poem that explains the concept of recursion in programming.",

for chunk in completion:
    print(chunk.choices[0].delta.content, end="")

Hugging Face Hub token

To use a model with gated access, ensure configuring the HUGGING_FACE_HUB_TOKEN environment variable (with --env in dstack run or using env in the configuration file).

Source code

The complete, ready-to-run code is available in dstackai/dstack-examples.

What's next?

  1. Check the vLLM and Text Generation Inference examples
  2. Read about services
  3. Browse examples
  4. Join the Discord server