Skip to content


Services make it very easy to deploy any kind of model or application as public, secure, and scalable endpoints.


If you're using the open-source server, you must set up a gateway before you can run a service.

If you're using dstack Sky , the gateway is already set up for you.


First, create a YAML file in your project folder. Its name must end with .dstack.yml (e.g. .dstack.yml or serve.dstack.yml are both acceptable).

type: service

python: "3.11"
  - MODEL=NousResearch/Llama-2-7b-chat-hf
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

  gpu: 80GB

# (Optional) Enable the OpenAI-compatible endpoint
  format: openai
  type: chat
  name: NousResearch/Llama-2-7b-chat-hf

If you don't specify your Docker image, dstack uses the base image (pre-configured with Python, Conda, and essential CUDA drivers).

Replicas and scaling

By default, the service is deployed to a single instance. However, you can specify the number of replicas and scaling policy. In this case, dstack auto-scales it based on the load.

See the .dstack.yml reference for many examples on service configuration.


To run a configuration, use the dstack run command followed by the working directory path, configuration file path, and any other options.

$ dstack run . -f serve.dstack.yml

 BACKEND     REGION         RESOURCES                     SPOT  PRICE
 tensordock  unitedkingdom  10xCPU, 80GB, 1xA100 (80GB)   no    $1.595
 azure       westus3        24xCPU, 220GB, 1xA100 (80GB)  no    $3.673
 azure       westus2        24xCPU, 220GB, 1xA100 (80GB)  no    $3.673

Continue? [y/n]: y

---> 100%

Service is published at

When deploying the service, dstack run mounts the current folder's contents.


If there are large files or folders you'd like to avoid uploading, you can list them in .gitignore.

See the CLI reference for more details on how dstack run works.

Service endpoint

One the service is up, its endpoint is accessible at https://<run name>.<gateway domain>.

By default, the service endpoint requires the Authorization header with Bearer <dstack token>.

$ curl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -d '{
        "model": "NousResearch/Llama-2-7b-chat-hf",
        "messages": [
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming."

Authorization can be disabled by setting auth to false in the service configuration file.

Model endpoint

In case the service has the model mapping configured, you will also be able to access the model at https://gateway.<gateway domain> via the OpenAI-compatible interface.

Managing runs

Stopping runs

When you use dstack stop, the service and its cloud resources are deleted.

Listing runs

The dstack ps command lists all running runs and their status.

What's next?

  1. Check the Text Generation Inference and vLLM examples
  2. Check the .dstack.yml reference for more details and examples
  3. See gateways on how to set up a gateway
  4. Browse examples