dstack is an open-source control plane that simplifies GPU orchestration for both training and inference — across cloud providers, hardware vendors, and frameworks. Over the past year, we've been steadily making inference a first-class citizen in dstack.
In a recent engineering blog post, Graphsignal shared autodebug, an autonomous loop that deploys an inference service, benchmarks it, updates the deployment config, and redeploys it again. This case study looks at the team workflow behind that setup, and how dstack gives Graphsignal a common layer for GPU development, inference deployment, and benchmarking.
Andrej Karpathy’s autoresearch demo is a crisp example of “agentic engineering” in practice: a short Markdown spec (program.md) drives an automated research cycle that iterates many times on one GPU with minimal human involvement. This post extends that same idea one layer down.
While dstack started as a GPU-native orchestrator for development and training, over the last year it has increasingly brought inference to the forefront — making serving a first-class citizen.
At the end of last year, we introduced SGLang router integration — bringing cache-aware routing to services. Today, building on that integration, we’re adding native Prefill–Decode (PD) disaggregation.
We’re releasing dstack 0.20.0, a major update that improves how teams orchestrate GPU workloads for development, training, and inference. Most dstack updates are incremental and backward compatible, but this version introduces a few major changes to how you work with dstack.
In dstack 0.20.0, fleets are now a first-class concept, giving you more explicit control over how GPU capacity is provisioned and managed. We’ve also added Events, which record important system activity—such as scheduling decisions, run status changes, and resource lifecycle updates—so it’s easier to understand what’s happening without digging through server logs.
This post goes through the changes in detail and explains how to upgrade and migrate your existing setup.
In a recent engineering blog post, Toffee shared how they use dstack to run large-language and image-generation models across multiple GPU clouds, while keeping their core backend on AWS. This case study summarizes key insights and highlights how dstack became the backbone of Toffee’s multi-cloud inference stack.
dstack provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling.
As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on dstack. To move this forward, we’re introducing native integration between dstack and SGLang’s Model Gateway (formerly known as the SGLang Router).
With support from Graphsignal, our team gained access to the new NVIDIA DGX Spark and used it to validate how dstack operates on this hardware. This post walks through how to set it up with dstack and use it alongside existing on-prem clusters or GPU cloud environments to run workloads.
dstack gives teams a unified way to run and manage GPU-native containers across clouds and on-prem environments — without requiring Kubernetes.
At the same time, many organizations rely on Kubernetes as the foundation of their infrastructure.
To support these users, dstack is releasing the beta of its native Kubernetes integration.
This benchmark investigates whether the Prefill–Decode worker ratio needs to be managed dynamically at runtime, or if a fixed split can deliver the same performance with simpler orchestration.
We evaluate different ratios across workload profiles and concurrency levels to measure their impact on TTFT, ITL, and throughput, and to see whether fixing the ratio in advance is a practical alternative to dynamic adjustment.