Andrej Karpathy’s autoresearch demo is a crisp example of “agentic engineering” in practice: a short Markdown spec (program.md) drives an automated research cycle that iterates many times on one GPU with minimal human involvement. This post extends that same idea one layer down.
While dstack started as a GPU-native orchestrator for development and training, over the last year it has increasingly brought inference to the forefront — making serving a first-class citizen.
At the end of last year, we introduced SGLang router integration — bringing cache-aware routing to services. Today, building on that integration, we’re adding native Prefill–Decode (PD) disaggregation.
We’re releasing dstack 0.20.0, a major update that improves how teams orchestrate GPU workloads for development, training, and inference. Most dstack updates are incremental and backward compatible, but this version introduces a few major changes to how you work with dstack.
In dstack 0.20.0, fleets are now a first-class concept, giving you more explicit control over how GPU capacity is provisioned and managed. We’ve also added Events, which record important system activity—such as scheduling decisions, run status changes, and resource lifecycle updates—so it’s easier to understand what’s happening without digging through server logs.
This post goes through the changes in detail and explains how to upgrade and migrate your existing setup.
dstack provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling.
As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on dstack. To move this forward, we’re introducing native integration between dstack and SGLang’s Model Gateway (formerly known as the SGLang Router).
dstack gives teams a unified way to run and manage GPU-native containers across clouds and on-prem environments — without requiring Kubernetes.
At the same time, many organizations rely on Kubernetes as the foundation of their infrastructure.
To support these users, dstack is releasing the beta of its native Kubernetes integration.
dstack is an open-source control plane for orchestrating GPU workloads. It can provision cloud VMs, run on top of Kubernetes, or manage on-prem clusters. If you don’t want to self-host, you can use dstack Sky, the managed version of dstack that also provides access to cloud GPUs via its markfetplace.
With our latest release, we’re excited to announce that Nebius, a purpose-built AI cloud for large scale training and inference, has joined the dstack Sky marketplace
to offer on-demand and spot GPUs, including clusters.
Orchestration automates provisioning, running jobs, and tearing them down. While Kubernetes and Slurm are powerful in their domains, they lack the lightweight, GPU-native focus modern teams need to move faster.
dstack is built entirely around GPUs. Our latest update introduces native integration with DigitalOcean and
AMD Developer Cloud, enabling teams to provision cloud GPUs and run workloads more cost-efficiently.
dstack services are long-running workloads—most often inference endpoints and sometimes web apps—that run continuously on GPU or CPU instances. They can scale across replicas and support rolling deployments.
This release adds HTTP probes inspired by Kubernetes readiness probes. Probes periodically call an endpoint on each replica (for example, /health) to confirm it responds as expected. The result gives clear visibility into startup progress and, during rolling deployments, ensures traffic only shifts to a replacement replica after all configured probes have proven ready.
In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.
dstack already supports GPU telemetry monitoring through NVIDIA DCGM metrics, covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM background health checks. With these, dstack continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.
As the ecosystem around AMD GPUs matures, developers are looking for easier ways to experiment with ROCm, benchmark new architectures, and run cost-effective workloads—without manual infrastructure setup.
dstack is an open-source orchestrator designed for AI workloads, providing a lightweight, container-native alternative to Kubernetes and Slurm.
Today, we’re excited to announce native integration with Hot Aisle, an AMD-only GPU neocloud offering VMs and clusters at highly competitive on-demand pricing.