As we wrap up an exciting year at dstack, we’re thrilled to introduce our Ambassador Program. This initiative invites AI
infrastructure enthusiasts and those passionate about open-source AI to share their knowledge, contribute to the growth
of the dstack community, and play a key role in advancing the open AI ecosystem.
At dstack, we aim to simplify AI model development, training, and deployment of AI models by offering an
alternative to the complex Kubernetes ecosystem. Our goal is to enable seamless AI infrastructure management across any
cloud or hardware vendor.
As 2024 comes to a close, we reflect on the milestones we've achieved and look ahead to the next steps.
GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory
saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.
We examine the effect of limited parallel computational resources on throughput and Time to First Token (TTFT).
Additionally, we compare deployment strategies: running two Llama 3.1 405B FP8 replicas on 4xMI300x versus a single
replica on 4xMI300x and 8xMI300x
Finally, we extrapolate performance projections for upcoming GPUs like NVIDIA H200, B200, and AMD MI325x, MI350x.
This benchmark is made possible through the generous support of our friends at
Hot Aisle and
Lambda ,
who provided high-end hardware.
Until now, dstack supported data persistence only with network volumes, managed by clouds.
While convenient, sometimes you might want to use a simple cache on the instance or
mount an NFS share to your SSH fleet. To address this, we're now introducing instance volumes that work for both cases.
To run containers with dstack, you can use your own Docker image (or the default one) without a need to interact
directly with Docker. However, some existing code may require direct use of Docker or Docker Compose. That's why,
in our latest release, we've added this option.
While it's possible to use third-party monitoring tools with dstack, it is often more convenient to debug your run and
track metrics out of the box. That's why, with the latest release, dstack introduced dstack stats, a new CLI (and API)
for monitoring container metrics, including GPU usage for NVIDIA, AMD, and other accelerators.
At dstack, we've been adding support for AMD GPUs with SSH fleets,
so we saw this as a great chance to test our integration by benchmarking AMD GPUs. Our friends at
Hot Aisle , who build top-tier
bare metal compute for AMD GPUs, kindly provided the hardware for the benchmark.
If you’re using or planning to use TPUs with Google Cloud, you can now do so via dstack. Just specify the TPU version and the number of cores
(separated by a dash), in the gpu property under resources.
Read below to find out how to use TPUs with dstack for fine-tuning and deploying
LLMs, leveraging open-source tools like Hugging Face’s
Optimum TPU
and vLLM .
While dstack helps streamline the orchestration of containers for AI, its primary goal is to offer vendor independence
and portability, ensuring compatibility across different hardware and cloud providers.
Inspired by the recent MI300X benchmarks, we are pleased to announce that RunPod is the first cloud provider to offer
AMD GPUs through dstack, with support for other cloud providers and on-prem servers to follow.
Deploying custom models in the cloud often faces the challenge of cold start times, including the time to provision a
new instance and download the model. This is especially relevant for services with autoscaling when new model replicas
need to be provisioned quickly.
Let's explore how dstack optimizes this process using volumes, with an example of
deploying a model on RunPod.