2025¶

May 22, 2025
in Case studies, NVIDIA
3 min read

Case study: how EA uses dstack to fast-track AI development

At NVIDIA GTC 2025, Electronic Arts shared how they’re scaling AI development and managing infrastructure across teams. They highlighted using tools like dstack to provision GPUs quickly, flexibly, and cost-efficiently. This case study summarizes key insights from their talk.

EA has over 100+ AI projects running, and the number keeps growing. There are many teams with AI needs—game dev, ML engineers, AI researchers, and platform teams—supported by a central tech team. Some need full MLOps support; others have in-house expertise but need flexible tooling and infrastructure.

May 12, 2025
in ARM, Cloud fleets, SSH fleets
3 min read

Supporting ARM and NVIDIA GH200 on Lambda

The latest update to dstack introduces support for NVIDIA GH200 instances on Lambda and enables ARM-powered hosts, including GH200 and GB200, with SSH fleets.

April 11, 2025
in Cloud fleets, NVIDIA
2 min read

Supporting GPU provisioning and orchestration on Nebius

As demand for GPU compute continues to scale, open-source tools tailored for AI workloads are becoming critical to developer velocity and efficiency. dstack is an open-source orchestrator purpose-built for AI infrastructure—offering a lightweight, container-native alternative to Kubernetes and Slurm.

Today, we’re announcing native integration with Nebius , offering a streamlined developer experience for teams using GPUs for AI workloads.

April 3, 2025
in Metrics, AMD, NVIDIA
2 min read

Built-in UI for monitoring essential GPU metrics

AI workloads generate vast amounts of metrics, making it essential to have efficient monitoring tools. While our recent update introduced the ability to export available metrics to Prometheus for maximum flexibility, there are times when users need to quickly access essential metrics without the need to switch to an external tool.

Previously, we introduced a CLI command that allows users to view essential GPU metrics for both NVIDIA and AMD hardware. Now, with this latest update, we’re excited to announce the addition of a built-in dashboard within the dstack control plane.

April 2, 2025
in SSH fleets, Cloud fleets
2 min read

Supporting MPI and NCCL/RCCL tests

As AI models grow in complexity, efficient orchestration tools become increasingly important. Fleets introduced by dstack last year streamline task execution on both cloud and on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.

The strength of dstack lies in its flexibility. Users can leverage distributed framework like torchrun, accelerate, or others. dstack handles node provisioning, job execution, and automatically propagates system environment variables—such as DSTACK_NODE_RANK, DSTACK_MASTER_NODE_IP, DSTACK_GPUS_PER_NODE and others—to containers.

One use case dstack hasn’t supported until now is MPI, as it requires a scheduled environment or direct SSH connections between containers. Since mpirun is essential for running NCCL/RCCL tests—crucial for large-scale cluster usage—we’ve added support for it.

April 1, 2025
in Metrics, NVIDIA
2 min read

Exporting GPU, cost, and other metrics to Prometheus

Effective AI infrastructure management requires full visibility into compute performance and costs. AI researchers need detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage across projects.

While dstack provides key metrics through its UI and dstack metrics CLI, teams often need more granular data and prefer using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected metrics—covering fleets and runs—directly to Prometheus.

March 31, 2025
in Dev environments
2 min read

Accessing dev environments with Cursor

Dev environments enable seamless provisioning of remote instances with the necessary GPU resources, automatic repository fetching, and streamlined access via SSH or a preferred desktop IDE.

Previously, support was limited to VS Code. However, as developers rely on a variety of desktop IDEs, we’ve expanded compatibility. With this update, dev environments now offer effortless access for users of Cursor .

March 18, 2025
in Benchmarks, AMD, NVIDIA
6 min read

DeepSeek R1 inference performance: MI300X vs. H200

DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought outputs, placing significant demands on memory capacity and bandwidth.

In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements.

This benchmark was made possible through the generous support of our partners at Vultr and Lambda , who provided access to the necessary hardware.

March 11, 2025
in AMD, SSH fleets
3 min read

Using SSH fleets with TensorWave's private AMD cloud

Since last month, when we introduced support for private clouds and data centers, it has become easier to use dstack to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters.

In this tutorial, we’ll walk you through how dstack can be used with TensorWave using SSH fleets.

February 21, 2025
in Intel Gaudi, SSH fleets
3 min read

Supporting Intel Gaudi AI accelerators with SSH fleets

At dstack, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not just leading cloud providers and on-prem environments but also a wide range of accelerators.

With our latest release, we’re adding support for Intel Gaudi AI Accelerator and launching a new partnership with Intel.