Skip to content

Changelog

Deploying NVIDIA Dynamo PD disaggregation with dstack

dstack is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases dstack supports out of the box.

With the latest update, dstack added native support for NVIDIA Dynamo with Prefill-Decode (PD) disaggregation, letting a service run a Dynamo router, prefill workers, and decode workers as separate replica groups.

Deploying inference endpoints with PD disaggregation on AMD GPUs

dstack is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases dstack supports out of the box.

dstack recently added native support for Prefill–Decode (PD) disaggregation. It works with Shepherd Model Gateway (SMG) — a high-performance inference gateway evolved from the SGLang Router — on both NVIDIA and AMD, and with NVIDIA Dynamo on NVIDIA. This post walks through deploying it on AMD GPUs with SMG.

Model inference with Prefill-Decode disaggregation

While dstack started as a GPU-native orchestrator for development and training, over the last year it has increasingly brought inference to the forefront — making serving a first-class citizen.

At the end of last year, we introduced SGLang router integration — bringing cache-aware routing to services. Today, building on that integration, we’re adding native Prefill–Decode (PD) disaggregation.

dstack 0.20 GA: Fleet-first UX and other important changes

We’re releasing dstack 0.20.0, a major update that improves how teams orchestrate GPU workloads for development, training, and inference. Most dstack updates are incremental and backward compatible, but this version introduces a few major changes to how you work with dstack.

In dstack 0.20.0, fleets are now a first-class concept, giving you more explicit control over how GPU capacity is provisioned and managed. We’ve also added Events, which record important system activity—such as scheduling decisions, run status changes, and resource lifecycle updates—so it’s easier to understand what’s happening without digging through server logs.

This post goes through the changes in detail and explains how to upgrade and migrate your existing setup.

SGLang router integration and disaggregated inference roadmap

dstack provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling.

As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on dstack. To move this forward, we’re introducing native integration between dstack and SGLang’s Model Gateway (formerly known as the SGLang Router).

Orchestrating GPUs on Kubernetes clusters

dstack gives teams a unified way to run and manage GPU-native containers across clouds and on-prem environments — without requiring Kubernetes. At the same time, many organizations rely on Kubernetes as the foundation of their infrastructure.

To support these users, dstack is releasing the beta of its native Kubernetes integration.

Nebius in dstack Sky GPU marketplace, with production-ready GPU clusters

dstack is an open-source control plane for orchestrating GPU workloads. It can provision cloud VMs, run on top of Kubernetes, or manage on-prem clusters. If you don’t want to self-host, you can use dstack Sky, the managed version of dstack that also provides access to cloud GPUs via its markfetplace.

With our latest release, we’re excited to announce that Nebius, a purpose-built AI cloud for large scale training and inference, has joined the dstack Sky marketplace to offer on-demand and spot GPUs, including clusters.

Orchestrating GPUs on DigitalOcean and AMD Developer Cloud

Orchestration automates provisioning, running jobs, and tearing them down. While Kubernetes and Slurm are powerful in their domains, they lack the lightweight, GPU-native focus modern teams need to move faster.

dstack is built entirely around GPUs. Our latest update introduces native integration with DigitalOcean and AMD Developer Cloud, enabling teams to provision cloud GPUs and run workloads more cost-efficiently.