Skip to content

Services

Services allow you to deploy models or any web app as a secure and scalable endpoint.

When running models, services provide access through the unified OpenAI-compatible endpoint.

Define a configuration

First, create a YAML file in your project folder. Its name must end with .dstack.yml (e.g. .dstack.yml or service.dstack.yml are both acceptable).

type: service
name: llama31

# If `image` is not specified, dstack uses its default image
python: "3.11"
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - pip install vllm
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN
    --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu: 24GB

If you don't specify your Docker image, dstack uses the base image (pre-configured with Python, Conda, and essential CUDA drivers).

Note, the model property is optional and not needed when deploying a non-OpenAI-compatible model or a regular web app.

Gateway

To enable auto-scaling, or use a custom domain with HTTPS, set up a gateway before running the service. If you're using dstack Sky , a gateway is pre-configured for you.

Reference

See .dstack.yml for all the options supported by services, along with multiple examples.

Run a service

To run a configuration, use the dstack apply command.

$ HF_TOKEN=...
$ dstack apply -f service.dstack.yml

 #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:3  yes   $0.33

Submit the run llama31? [y/n]: y

Provisioning...
---> 100%

Service is published at: 
  http://localhost:3000/proxy/services/main/llama31/
Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
  http://localhost:3000/proxy/models/main/

dstack apply automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes), and runs the configuration.

Access the endpoint

Service

If no gateway is created, the service’s endpoint will be accessible at <dstack server URL>/proxy/services/<project name>/<run name>/.

$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming."
            }
        ]
    }'

When a gateway is configured, the service endpoint will be accessible at https://<run name>.<gateway domain>.

By default, the service endpoint requires the Authorization header with Bearer <dstack token>. Authorization can be disabled by setting auth to false in the service configuration file.

Model

If the service defines the model property, the model can be accessed with the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/, or via the control plane UI's playground.

When a gateway is configured, the OpenAI-compatible endpoint is available at https://gateway.<gateway domain>/.

Manage runs

List runs

The dstack ps command lists all running jobs and their statuses. Use --watch (or -w) to monitor the live status of runs.

Stop a run

Once the run exceeds the max_duration, or when you use dstack stop, the dev environment is stopped. Use --abort or -x to stop the run abruptly.

Manage fleets

Creation policy

By default, when you run dstack apply with a dev environment, task, or service, dstack reuses idle instances from an existing fleet. If no idle instances matching the requirements, it automatically creates a new fleet using backends.

To ensure dstack apply doesn't create a new fleet but reuses an existing one, pass -R (or --reuse) to dstack apply.

$ dstack apply -R -f examples/.dstack.yml

Alternatively, set creation_policy to reuse in the run configuration.

Termination policy

If a fleet is created automatically, it remains idle for 5 minutes and can be reused within that time. To change the default idle duration, set termination_idle_time in the run configuration (e.g., to 0 or a longer duration).

Fleets

For greater control over fleet provisioning, configuration, and lifecycle management, it is recommended to use fleets directly.

What's next?

  1. Read about dev environments, tasks, and repos
  2. Learn how to manage fleets
  3. See how to set up gateways
  4. Check the TGI , vLLM , and NIM examples

Reference

See .dstack.yml for all the options supported by services, along with multiple examples.