Skip to content

Services

A service allows you to deploy a model or a web app as an endpoint. It lets you configure dependencies, resources, authorization, auto-scaling rules, etc.

Define a configuration

First, create a YAML file in your project folder. Its name must end with .dstack.yml (e.g. .dstack.yml or service.dstack.yml are both acceptable).

type: service
name: llama31

# If `image` is not specified, dstack uses its default image
python: "3.11"
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - pip install vllm
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN
    --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu: 24GB

If you don't specify your Docker image, dstack uses the base image (pre-configured with Python, Conda, and essential CUDA drivers).

Note, the model property is optional and not needed when deploying a non-OpenAI-compatible model or a regular web app.

Gateway

To enable auto-scaling, or use a custom domain with HTTPS, set up a gateway before running the service. If you're using dstack Sky , a gateway is pre-configured for you.

Reference

See .dstack.yml for all the options supported by services, along with multiple examples.

Run a service

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ dstack apply -f service.dstack.yml

 #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:3  yes   $0.33

Submit the run llama31? [y/n]: y

Provisioning...
---> 100%

Service is published at: 
  http://localhost:3000/proxy/services/main/llama31/

dstack apply automatically uploads the code from the current repo, including your local uncommitted changes. To avoid uploading large files, ensure they are listed in .gitignore.

Access the endpoint

Service

If no gateway is created, the service’s endpoint will be accessible at <dstack server URL>/proxy/services/<project name>/<run name>/.

$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming."
            }
        ]
    }'

When a gateway is configured, the service endpoint will be accessible at https://<run name>.<gateway domain>.

By default, the service endpoint requires the Authorization header with Bearer <dstack token>. Authorization can be disabled by setting auth to false in the service configuration file.

Model

If the service defines the model property, the model can be accessed with the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/, or via the control plane UI's playground.

When a gateway is configured, the OpenAI-compatible endpoint is available at https://gateway.<gateway domain>/.

Manage runs

List runs

The dstack ps command lists all running jobs and their statuses. Use --watch (or -w) to monitor the live status of runs.

Stop a run

Once the run exceeds the max_duration, or when you use dstack stop, the dev environment is stopped. Use --abort or -x to stop the run abruptly.

Manage fleets

By default, dstack apply reuses idle instances from one of the existing fleets, or creates a new fleet through backends.

Idle duration

To ensure the created fleets are deleted automatically, set termination_idle_time. By default, it's set to 5min.

Creation policy

To ensure dstack apply always reuses an existing fleet and doesn't create a new one, pass --reuse to dstack apply (or set creation_policy to reuse in the task configuration). The default policy is reuse_or_create.

What's next?

  1. Check the TGI , vLLM , and NIM examples
  2. See gateways on how to set up a gateway
  3. Browse examples
  4. See fleets on how to manage fleets

Reference

See .dstack.yml for all the options supported by services, along with multiple examples.