Andrej Karpathy’s autoresearch demo is a crisp example of “agentic engineering” in practice: a short Markdown spec (program.md) drives an automated research cycle that iterates many times on one GPU with minimal human involvement. This post extends that same idea one layer down.
While dstack started as a GPU-native orchestrator for development and training, over the last year it has increasingly brought inference to the forefront — making serving a first-class citizen.
At the end of last year, we introduced SGLang router integration — bringing cache-aware routing to services. Today, building on that integration, we’re adding native Prefill–Decode (PD) disaggregation.
We’re releasing dstack 0.20.0, a major update that improves how teams orchestrate GPU workloads for development, training, and inference. Most dstack updates are incremental and backward compatible, but this version introduces a few major changes to how you work with dstack.
In dstack 0.20.0, fleets are now a first-class concept, giving you more explicit control over how GPU capacity is provisioned and managed. We’ve also added Events, which record important system activity—such as scheduling decisions, run status changes, and resource lifecycle updates—so it’s easier to understand what’s happening without digging through server logs.
This post goes through the changes in detail and explains how to upgrade and migrate your existing setup.
In a recent engineering blog post, Toffee shared how they use dstack to run large-language and image-generation models across multiple GPU clouds, while keeping their core backend on AWS. This case study summarizes key insights and highlights how dstack became the backbone of Toffee’s multi-cloud inference stack.
dstack provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling.
As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on dstack. To move this forward, we’re introducing native integration between dstack and SGLang’s Model Gateway (formerly known as the SGLang Router).
With support from Graphsignal, our team gained access to the new NVIDIA DGX Spark and used it to validate how dstack operates on this hardware. This post walks through how to set it up with dstack and use it alongside existing on-prem clusters or GPU cloud environments to run workloads.
dstack gives teams a unified way to run and manage GPU-native containers across clouds and on-prem environments — without requiring Kubernetes.
At the same time, many organizations rely on Kubernetes as the foundation of their infrastructure.
To support these users, dstack is releasing the beta of its native Kubernetes integration.
This benchmark investigates whether the Prefill–Decode worker ratio needs to be managed dynamically at runtime, or if a fixed split can deliver the same performance with simpler orchestration.
We evaluate different ratios across workload profiles and concurrency levels to measure their impact on TTFT, ITL, and throughput, and to see whether fixing the ratio in advance is a practical alternative to dynamic adjustment.
dstack is an open-source control plane for orchestrating GPU workloads. It can provision cloud VMs, run on top of Kubernetes, or manage on-prem clusters. If you don’t want to self-host, you can use dstack Sky, the managed version of dstack that also provides access to cloud GPUs via its markfetplace.
With our latest release, we’re excited to announce that Nebius, a purpose-built AI cloud for large scale training and inference, has joined the dstack Sky marketplace
to offer on-demand and spot GPUs, including clusters.
This is a practical map for teams renting GPUs — whether you’re a single project team fine-tuning models or a production-scale team managing thousand-GPU workloads. We’ll break down where providers fit, what actually drives performance, how pricing really works, and how to design a control plane that makes multi-cloud not just possible, but a competitive advantage.