Skip to content

Blog

Deploying inference endpoints with PD disaggregation on AMD GPUs

dstack is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases dstack supports out of the box.

dstack recently added native support for Prefill–Decode (PD) disaggregation. It works with Shepherd Model Gateway (SMG) — a high-performance inference gateway evolved from the SGLang Router — on both NVIDIA and AMD, and with NVIDIA Dynamo on NVIDIA. This post walks through deploying it on AMD GPUs with SMG.

How Graphsignal uses dstack for inference benchmarking

In a recent engineering blog post, Graphsignal shared autodebug, an autonomous loop that deploys an inference service, benchmarks it, updates the deployment config, and redeploys it again. This case study looks at the team workflow behind that setup, and how dstack gives Graphsignal a common layer for GPU development, inference deployment, and benchmarking.

Model inference with Prefill-Decode disaggregation

While dstack started as a GPU-native orchestrator for development and training, over the last year it has increasingly brought inference to the forefront — making serving a first-class citizen.

At the end of last year, we introduced SGLang router integration — bringing cache-aware routing to services. Today, building on that integration, we’re adding native Prefill–Decode (PD) disaggregation.

dstack 0.20 GA: Fleet-first UX and other important changes

We’re releasing dstack 0.20.0, a major update that improves how teams orchestrate GPU workloads for development, training, and inference. Most dstack updates are incremental and backward compatible, but this version introduces a few major changes to how you work with dstack.

In dstack 0.20.0, fleets are now a first-class concept, giving you more explicit control over how GPU capacity is provisioned and managed. We’ve also added Events, which record important system activity—such as scheduling decisions, run status changes, and resource lifecycle updates—so it’s easier to understand what’s happening without digging through server logs.

This post goes through the changes in detail and explains how to upgrade and migrate your existing setup.

SGLang router integration and disaggregated inference roadmap

dstack provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling.

As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on dstack. To move this forward, we’re introducing native integration between dstack and SGLang’s Model Gateway (formerly known as the SGLang Router).

Orchestrating GPUs on Kubernetes clusters

dstack gives teams a unified way to run and manage GPU-native containers across clouds and on-prem environments — without requiring Kubernetes. At the same time, many organizations rely on Kubernetes as the foundation of their infrastructure.

To support these users, dstack is releasing the beta of its native Kubernetes integration.