Deploying NVIDIA Dynamo PD disaggregation with dstack¶
dstack is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases dstack supports out of the box.
With the latest update, dstack added native support for NVIDIA Dynamo with Prefill-Decode (PD) disaggregation, letting a service run a Dynamo router, prefill workers, and decode workers as separate replica groups.

About NVIDIA Dynamo¶
NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework for serving generative AI workloads in distributed environments. It adds a system-level layer above inference engines such as SGLang, vLLM, and TensorRT-LLM, coordinating them across GPUs and nodes.
Dynamo brings together disaggregated serving, intelligent routing, KV cache management, KV cache transfer, and automatic scaling to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.
PD disaggregation
Prefill-Decode disaggregation separates the two phases of LLM inference: prompt processing (prefill) and token generation (decode). Prefill is compute-bound and parallelizable. Decode is memory-bound and sequential. Running them as separate pools allows each phase to be sized and scaled independently.
PD disaggregation with dstack¶
To deploy NVIDIA Dynamo with PD disaggregation, define a service with three replica groups:
- a Dynamo router
- prefill workers
- decode workers
The router replica group declares router: { type: dynamo }. This tells dstack to route external traffic only to the router replica and to inject DSTACK_ROUTER_INTERNAL_IP into the worker replicas after the router is provisioned.
This support was introduced in 0.20.20.
Prerequisites
Running PD disaggregation on dstack requires a fleet with cluster placement, because prefill and decode workers need a fast interconnect for KV cache transfer.
The prefill and decode replicas run on GPUs. The router replica can run on CPU, but it must run in the same cluster.
Deploying the service¶
Here's a complete service configuration that deploys zai-org/GLM-4.5-Air-FP8 with NVIDIA Dynamo, SGLang workers, and PD disaggregation on dstack:
type: service
name: dynamo-pd
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
replicas:
- count: 1
docker: true
commands:
- apt-get update
- apt-get install -y python3-dev python3-venv
- python3 -m venv ~/dyn-venv
- source ~/dyn-venv/bin/activate
- pip install -U pip
- pip install "ai-dynamo[sglang]==1.1.1"
- git clone https://github.com/ai-dynamo/dynamo.git
# Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend.
- docker compose -f dynamo/dev/docker-compose.yml up -d
- |
python3 -m dynamo.frontend \
--http-host 0.0.0.0 --http-port 8000 \
--discovery-backend etcd --router-mode kv \
--kv-cache-block-size 64
resources:
cpu: 4
router:
type: dynamo
- count: 1..4
scaling:
metric: rps
target: 3
python: "3.12"
nvcc: true
commands:
# dstack injects DSTACK_ROUTER_INTERNAL_IP after the router replica
# is provisioned. Compose the etcd/NATS endpoints from it.
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
# Set to enable /health endpoint required by dstack probes.
- export DYN_SYSTEM_PORT="8000"
# Wait until the router's etcd and NATS ports are actually accepting connections.
- |
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
done
- pip install "ai-dynamo[sglang]==1.1.1"
- |
python3 -m dynamo.sglang \
--model-path $MODEL_ID --served-model-name $MODEL_ID \
--discovery-backend etcd --host 0.0.0.0 \
--page-size 64 \
--disaggregation-mode prefill --disaggregation-transfer-backend nixl
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
python: "3.12"
nvcc: true
commands:
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
- export DYN_SYSTEM_PORT="8000"
- |
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
done
- pip install "ai-dynamo[sglang]==1.1.1"
- |
python3 -m dynamo.sglang \
--model-path $MODEL_ID --served-model-name $MODEL_ID \
--discovery-backend etcd --host 0.0.0.0 \
--page-size 64 \
--disaggregation-mode decode --disaggregation-transfer-backend nixl
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15s
The router replica group starts the Dynamo HTTP frontend and the NATS/etcd compose stack used by the workers. It declares router: { type: dynamo }, so dstack treats it as the service router.
The prefill and decode replica groups use the router's internal IP to set ETCD_ENDPOINTS and NATS_SERVER, wait for those services to become reachable, then start dynamo.sglang in either prefill or decode mode. DYN_SYSTEM_PORT=8000 exposes the /health endpoint required by the dstack probe.
In this setup, Dynamo uses etcd for worker discovery and NATS for worker and KV-cache events used by the router. NIXL handles KV cache transfer between prefill and decode workers. dstack handles provisioning, service exposure, health probes, and independent scaling of the prefill and decode replica groups.
With the
dynamorouter,dstackcan run SGLang, vLLM, or TensorRT-LLM prefill and decode workers.
Apply the configuration:
$ HF_TOKEN=...
$ dstack apply -f dynamo-pd.dstack.yml
Once provisioning completes, dstack exposes a single OpenAI-compatible endpoint. Without a gateway, the endpoint is available through the server proxy:
$ curl http://127.0.0.1:3000/proxy/services/main/dynamo-pd/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "zai-org/GLM-4.5-Air-FP8",
"messages": [
{
"role": "user",
"content": "What is prefill-decode disaggregation?"
}
],
"max_tokens": 1024
}'
If a gateway is configured, the service endpoint is available at https://dynamo-pd.<gateway domain>/.
Limitations
- The router replica group must use
count: 1. - Services with a Dynamo router cannot configure
retry, because workers cache the router's internal IP at provisioning time. - In-place updates are blocked when they would replace the Dynamo router replica. If the router gets a new internal IP, already-running workers would still point to the old etcd and NATS endpoints. Stop the run and apply again for router-affecting changes.
- The
scalingblocks usedstackservice autoscaling, which currently scales replica groups based onrps. Support for scaling based on inference metrics such as TTFT and ITL is planned.
Why this matters¶
Dynamo brings system-level inference optimizations such as disaggregated serving, KV-aware routing, KV cache transfer, and coordination across workers. dstack complements it with orchestration for provisioning compute, cluster placement, service exposure, health probes, and independent scaling of worker groups.
With native Dynamo support, dstack streamlines high-throughput inference with leading open-source serving frameworks, while avoiding custom deployment glue. The same dstack orchestration layer can be used for training, inference, and development across GPU clouds, Kubernetes clusters, and on-prem fleets.
What's next?¶
- Read the NVIDIA Dynamo example
- Read about services, fleets, and gateways
- Review the NVIDIA Dynamo documentation and Dynamo GitHub repository
- Join Discord