Skip to content

Blog

Supporting Intel Gaudi AI accelerators

At dstack, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not just leading cloud providers and on-prem environments but also a wide range of accelerators.

With our latest release, we’re adding support for Intel Gaudi AI Accelerator and launching a new partnership with Intel.

Efficient distributed training with AWS EFA

Amazon Elastic Fabric Adapter (EFA) is a high-performance network interface designed for AWS EC2 instances, enabling ultra-low latency and high-throughput communication between nodes. This makes it an ideal solution for scaling distributed training workloads across multiple GPUs and instances.

With the latest release of dstack, you can now leverage AWS EFA to supercharge your distributed training tasks.

Auto-shutdown for inactive dev environments—no idle GPUs

Whether you’re using cloud or on-prem compute, you may want to test your code before launching a training task or deploying a service. dstack’s dev environments make this easy by setting up a remote machine, cloning your repository, and configuring your IDE —all within a container that has GPU access.

One issue with dev environments is forgetting to stop them or closing your laptop, leaving the GPU idle and costly. With our latest update, dstack now detects inactive environments and automatically shuts them down, saving you money.

Orchestrating GPUs in data centers and private clouds

Recent breakthroughs in open-source AI have made AI infrastructure accessible beyond public clouds, driving demand for running AI workloads in on-premises data centers and private clouds. This shift offers organizations both high-performant clusters and flexibility and control.

However, Kubernetes, while a popular choice for traditional deployments, is often too complex and low-level to address the needs of AI teams.

Originally, dstack was focused on public clouds. With the new release, dstack extends support to data centers and private clouds, offering a simpler, AI-native solution that replaces Kubernetes and Slurm.

Supporting NVIDIA and AMD accelerators on Vultr

As demand for AI infrastructure grows, the need for efficient, vendor-neutral orchestration tools is becoming increasingly important. At dstack, we’re committed to redefining AI container orchestration by prioritizing an AI-native, open-source-first approach. Today, we’re excited to share a new integration and partnership with Vultr .

This new integration enables Vultr customers to train and deploy models on both AMD and NVIDIA GPUs with greater flexibility and efficiency–using dstack.

Get involved as a community ambassador

As we wrap up an exciting year at dstack, we’re thrilled to introduce our Ambassador Program. This initiative invites AI infrastructure enthusiasts and those passionate about open-source AI to share their knowledge, contribute to the growth of the dstack community, and play a key role in advancing the open AI ecosystem.

Exploring inference memory saturation effect: H100 vs MI300x

GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.

We examine the effect of limited parallel computational resources on throughput and Time to First Token (TTFT). Additionally, we compare deployment strategies: running two Llama 3.1 405B FP8 replicas on 4xMI300x versus a single replica on 4xMI300x and 8xMI300x

Finally, we extrapolate performance projections for upcoming GPUs like NVIDIA H200, B200, and AMD MI325x, MI350x.

This benchmark is made possible through the generous support of our friends at Hot Aisle and Lambda , who provided high-end hardware.

Introducing instance volumes to persist data on instances

Until now, dstack supported data persistence only with network volumes, managed by clouds. While convenient, sometimes you might want to use a simple cache on the instance or mount an NFS share to your SSH fleet. To address this, we're now introducing instance volumes that work for both cases.

type: task 
name: llama32-task

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
commands:
  - pip install vllm
  - vllm serve $MODEL_ID --max-model-len 4096
ports: [8000]

volumes:
  - /root/.dstack/cache:/root/.cache

resources:
  gpu: 16GB..