Supporting ARM and NVIDIA GH200 on Lambda
The latest update to dstack
introduces support for NVIDIA GH200 instances on Lambda
and enables ARM-powered hosts, including GH200 and GB200, with SSH fleets.
The latest update to dstack
introduces support for NVIDIA GH200 instances on Lambda
and enables ARM-powered hosts, including GH200 and GB200, with SSH fleets.
As AI models grow in complexity, efficient orchestration tools become increasingly important.
Fleets introduced by dstack
last year streamline
task execution on both cloud and
on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.
The strength of dstack
lies in its flexibility. Users can leverage distributed framework like
torchrun
, accelerate
, or others. dstack
handles node provisioning, job execution, and automatically propagates
system environment variables—such as DSTACK_NODE_RANK
, DSTACK_MASTER_NODE_IP
,
DSTACK_GPUS_PER_NODE
and others—to containers.
One use case dstack
hasn’t supported until now is MPI, as it requires a scheduled environment or
direct SSH connections between containers. Since mpirun
is essential for running NCCL/RCCL tests—crucial for large-scale
cluster usage—we’ve added support for it.
Since last month, when we introduced support for private clouds and data centers, it has become easier to use dstack
to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters.
In this tutorial, we’ll walk you through how dstack
can be used with
TensorWave using
SSH fleets.
At dstack
, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not
just leading cloud providers and on-prem environments but also a wide range of accelerators.
With our latest release, we’re adding support for Intel Gaudi AI Accelerator and launching a new partnership with Intel.
Recent breakthroughs in open-source AI have made AI infrastructure accessible beyond public clouds, driving demand for running AI workloads in on-premises data centers and private clouds. This shift offers organizations both high-performant clusters and flexibility and control.
However, Kubernetes, while a popular choice for traditional deployments, is often too complex and low-level to address the needs of AI teams.
Originally, dstack
was focused on public clouds. With the new release, dstack
extends support to data centers and private clouds, offering a simpler, AI-native solution that replaces Kubernetes and
Slurm.
At dstack
, we aim to simplify AI model development, training, and deployment of AI models by offering an
alternative to the complex Kubernetes ecosystem. Our goal is to enable seamless AI infrastructure management across any
cloud or hardware vendor.
As 2024 comes to a close, we reflect on the milestones we've achieved and look ahead to the next steps.