Skip to content

Changelog

SGLang router integration and disaggregated inference roadmap

dstack provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling.

As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on dstack. To move this forward, we’re introducing native integration between dstack and SGLang’s Model Gateway (formerly known as the SGLang Router).

Orchestrating GPUs on Kubernetes clusters

dstack gives teams a unified way to run and manage GPU-native containers across clouds and on-prem environments — without requiring Kubernetes. At the same time, many organizations rely on Kubernetes as the foundation of their infrastructure.

To support these users, dstack is releasing the beta of its native Kubernetes integration.

Nebius in dstack Sky GPU marketplace, with production-ready GPU clusters

dstack is an open-source control plane for orchestrating GPU workloads. It can provision cloud VMs, run on top of Kubernetes, or manage on-prem clusters. If you don’t want to self-host, you can use dstack Sky, the managed version of dstack that also provides access to cloud GPUs via its markfetplace.

With our latest release, we’re excited to announce that Nebius, a purpose-built AI cloud for large scale training and inference, has joined the dstack Sky marketplace to offer on-demand and spot GPUs, including clusters.

Orchestrating GPUs on DigitalOcean and AMD Developer Cloud

Orchestration automates provisioning, running jobs, and tearing them down. While Kubernetes and Slurm are powerful in their domains, they lack the lightweight, GPU-native focus modern teams need to move faster.

dstack is built entirely around GPUs. Our latest update introduces native integration with DigitalOcean and AMD Developer Cloud, enabling teams to provision cloud GPUs and run workloads more cost-efficiently.

Introducing service probes

dstack services are long-running workloads—most often inference endpoints and sometimes web apps—that run continuously on GPU or CPU instances. They can scale across replicas and support rolling deployments.

This release adds HTTP probes inspired by Kubernetes readiness probes. Probes periodically call an endpoint on each replica (for example, /health) to confirm it responds as expected. The result gives clear visibility into startup progress and, during rolling deployments, ensures traffic only shifts to a replacement replica after all configured probes have proven ready.

Introducing passive GPU health checks

In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.

dstack already supports GPU telemetry monitoring through NVIDIA DCGM metrics, covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM background health checks. With these, dstack continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.

Supporting Hot Aisle AMD AI Developer Cloud

As the ecosystem around AMD GPUs matures, developers are looking for easier ways to experiment with ROCm, benchmark new architectures, and run cost-effective workloads—without manual infrastructure setup.

dstack is an open-source orchestrator designed for AI workloads, providing a lightweight, container-native alternative to Kubernetes and Slurm.

Today, we’re excited to announce native integration with Hot Aisle, an AMD-only GPU neocloud offering VMs and clusters at highly competitive on-demand pricing.

Rolling deployment, Secrets, Files, Tenstorrent, and more

Thanks to feedback from the community, dstack continues to evolve. Here’s a look at what’s new.

Rolling deployments

Previously, updating running services could cause downtime. The latest release fixes this with rolling deployments. Replicas are now updated one by one, allowing uninterrupted traffic during redeployments.

$ dstack apply -f .dstack.yml

Active run my-service already exists. Detected changes that can be updated in-place:
- Repo state (branch, commit, or other)
- File archives
- Configuration properties:
  - env
  - files

Update the run? [y/n]: y  Launching my-service...

 NAME                      BACKEND          PRICE    STATUS       SUBMITTED
 my-service deployment=1                             running      11 mins ago
   replica=0 deployment=0  aws (us-west-2)  $0.0026  terminating  11 mins ago
   replica=1 deployment=1  aws (us-west-2)  $0.0026  running      1 min ago

Supporting GPU provisioning and orchestration on Nebius

As demand for GPU compute continues to scale, open-source tools tailored for AI workloads are becoming critical to developer velocity and efficiency. dstack is an open-source orchestrator purpose-built for AI infrastructure—offering a lightweight, container-native alternative to Kubernetes and Slurm.

Today, we’re announcing native integration with Nebius, offering a streamlined developer experience for teams using GPUs for AI workloads.