Supporting ARM and NVIDIA GH200 on Lambda
The latest update to dstack
introduces support for NVIDIA GH200 instances on Lambda
and enables ARM-powered hosts, including GH200 and GB200, with SSH fleets.
The latest update to dstack
introduces support for NVIDIA GH200 instances on Lambda
and enables ARM-powered hosts, including GH200 and GB200, with SSH fleets.
As demand for GPU compute continues to scale, open-source tools tailored for AI workloads are becoming critical to
developer velocity and efficiency.
dstack
is an open-source orchestrator purpose-built for AI infrastructure—offering a lightweight, container-native
alternative to Kubernetes and Slurm.
Today, we’re announcing native integration with Nebius , offering a streamlined developer experience for teams using GPUs for AI workloads.
AI workloads generate vast amounts of metrics, making it essential to have efficient monitoring tools. While our recent update introduced the ability to export available metrics to Prometheus for maximum flexibility, there are times when users need to quickly access essential metrics without the need to switch to an external tool.
Previously, we introduced a CLI command that allows users to view essential GPU metrics for both NVIDIA
and AMD hardware. Now, with this latest update, we’re excited to announce the addition of a built-in dashboard within
the dstack
control plane.
As AI models grow in complexity, efficient orchestration tools become increasingly important.
Fleets introduced by dstack
last year streamline
task execution on both cloud and
on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.
The strength of dstack
lies in its flexibility. Users can leverage distributed framework like
torchrun
, accelerate
, or others. dstack
handles node provisioning, job execution, and automatically propagates
system environment variables—such as DSTACK_NODE_RANK
, DSTACK_MASTER_NODE_IP
,
DSTACK_GPUS_PER_NODE
and others—to containers.
One use case dstack
hasn’t supported until now is MPI, as it requires a scheduled environment or
direct SSH connections between containers. Since mpirun
is essential for running NCCL/RCCL tests—crucial for large-scale
cluster usage—we’ve added support for it.
Effective AI infrastructure management requires full visibility into compute performance and costs. AI researchers need detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage across projects.
While dstack
provides key metrics through its UI and dstack metrics
CLI, teams often need more granular data and prefer
using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected
metrics—covering fleets and runs—directly to Prometheus.
Dev environments enable seamless provisioning of remote instances with the necessary GPU resources, automatic repository fetching, and streamlined access via SSH or a preferred desktop IDE.
Previously, support was limited to VS Code. However, as developers rely on a variety of desktop IDEs, we’ve expanded compatibility. With this update, dev environments now offer effortless access for users of Cursor .
At dstack
, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not
just leading cloud providers and on-prem environments but also a wide range of accelerators.
With our latest release, we’re adding support for Intel Gaudi AI Accelerator and launching a new partnership with Intel.
Whether you’re using cloud or on-prem compute, you may want to test your code before launching a
training task or deploying a service. dstack
’s dev environments
make this easy by setting up a remote machine, cloning your repository, and configuring your IDE —all within
a container that has GPU access.
One issue with dev environments is forgetting to stop them or closing your laptop, leaving the GPU idle and costly. With
our latest update, dstack
now detects inactive environments and automatically shuts them down, saving you money.
Recent breakthroughs in open-source AI have made AI infrastructure accessible beyond public clouds, driving demand for running AI workloads in on-premises data centers and private clouds. This shift offers organizations both high-performant clusters and flexibility and control.
However, Kubernetes, while a popular choice for traditional deployments, is often too complex and low-level to address the needs of AI teams.
Originally, dstack
was focused on public clouds. With the new release, dstack
extends support to data centers and private clouds, offering a simpler, AI-native solution that replaces Kubernetes and
Slurm.
As demand for AI infrastructure grows, the need for efficient, vendor-neutral orchestration tools is becoming
increasingly important.
At dstack
, we’re committed to redefining AI container orchestration by prioritizing an AI-native, open-source-first
approach.
Today, we’re excited to share a new integration and partnership
with Vultr .
This new integration enables Vultr customers to train and deploy models on both AMD
and NVIDIA GPUs with greater flexibility and efficiency–using dstack
.