Supporting ARM and NVIDIA GH200 on Lambda
The latest update to dstack
introduces support for NVIDIA GH200 instances on Lambda
and enables ARM-powered hosts, including GH200 and GB200, with SSH fleets.
The latest update to dstack
introduces support for NVIDIA GH200 instances on Lambda
and enables ARM-powered hosts, including GH200 and GB200, with SSH fleets.
As demand for GPU compute continues to scale, open-source tools tailored for AI workloads are becoming critical to
developer velocity and efficiency.
dstack
is an open-source orchestrator purpose-built for AI infrastructure—offering a lightweight, container-native
alternative to Kubernetes and Slurm.
Today, we’re announcing native integration with Nebius , offering a streamlined developer experience for teams using GPUs for AI workloads.
As AI models grow in complexity, efficient orchestration tools become increasingly important.
Fleets introduced by dstack
last year streamline
task execution on both cloud and
on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.
The strength of dstack
lies in its flexibility. Users can leverage distributed framework like
torchrun
, accelerate
, or others. dstack
handles node provisioning, job execution, and automatically propagates
system environment variables—such as DSTACK_NODE_RANK
, DSTACK_MASTER_NODE_IP
,
DSTACK_GPUS_PER_NODE
and others—to containers.
One use case dstack
hasn’t supported until now is MPI, as it requires a scheduled environment or
direct SSH connections between containers. Since mpirun
is essential for running NCCL/RCCL tests—crucial for large-scale
cluster usage—we’ve added support for it.
Amazon Elastic Fabric Adapter (EFA) is a high-performance network interface designed for AWS EC2 instances, enabling ultra-low latency and high-throughput communication between nodes. This makes it an ideal solution for scaling distributed training workloads across multiple GPUs and instances.
With the latest release of dstack
, you can now leverage AWS EFA to supercharge your distributed training tasks.
As demand for AI infrastructure grows, the need for efficient, vendor-neutral orchestration tools is becoming
increasingly important.
At dstack
, we’re committed to redefining AI container orchestration by prioritizing an AI-native, open-source-first
approach.
Today, we’re excited to share a new integration and partnership
with Vultr .
This new integration enables Vultr customers to train and deploy models on both AMD
and NVIDIA GPUs with greater flexibility and efficiency–using dstack
.
At dstack
, we aim to simplify AI model development, training, and deployment of AI models by offering an
alternative to the complex Kubernetes ecosystem. Our goal is to enable seamless AI infrastructure management across any
cloud or hardware vendor.
As 2024 comes to a close, we reflect on the milestones we've achieved and look ahead to the next steps.