Benchmarks¶

September 25, 2025
in Benchmarks
6 min read

Benchmarking Prefill–Decode ratios: fixed vs dynamic

This benchmark investigates whether the Prefill–Decode worker ratio needs to be managed dynamically at runtime, or if a fixed split can deliver the same performance with simpler orchestration.
We evaluate different ratios across workload profiles and concurrency levels to measure their impact on TTFT, ITL, and throughput, and to see whether fixing the ratio in advance is a practical alternative to dynamic adjustment.

July 22, 2025
in Benchmarks
6 min read

Benchmarking AMD GPUs: bare-metal, VMs

This is the first in our series of benchmarks exploring the performance of AMD GPUs in virtualized versus bare-metal environments. As cloud infrastructure increasingly relies on virtualization, a key question arises: can VMs match bare-metal performance for GPU-intensive tasks? For this initial investigation, we focus specifically on a single-GPU setup, comparing a containerized workload on a VM against a bare-metal server, both equipped with the powerful AMD MI300X GPU.

July 15, 2025
in Benchmarks
9 min read

Benchmarking AMD GPUs: bare-metal, containers, partitions

Our new benchmark explores two important areas for optimizing AI workloads on AMD GPUs: First, do containers introduce a performance penalty for network-intensive tasks compared to a bare-metal setup? Second, how does partitioning a powerful GPU like the MI300X affect its real-world performance for different types of AI workloads?

March 18, 2025
in Benchmarks
6 min read

DeepSeek R1 inference performance: MI300X vs. H200

DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought outputs, placing significant demands on memory capacity and bandwidth.

In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements.

This benchmark was made possible through the generous support of our partners at Vultr and Lambda , who provided access to the necessary hardware.

December 5, 2024
in Benchmarks
6 min read

Exploring inference memory saturation effect: H100 vs MI300x

GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.

We examine the effect of limited parallel computational resources on throughput and Time to First Token (TTFT). Additionally, we compare deployment strategies: running two Llama 3.1 405B FP8 replicas on 4xMI300x versus a single replica on 4xMI300x and 8xMI300x

Finally, we extrapolate performance projections for upcoming GPUs like NVIDIA H200, B200, and AMD MI325x, MI350x.

This benchmark is made possible through the generous support of our friends at Hot Aisle and Lambda , who provided high-end hardware.

October 9, 2024
in Benchmarks
6 min read

Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs

At dstack, we've been adding support for AMD GPUs with SSH fleets, so we saw this as a great chance to test our integration by benchmarking AMD GPUs. Our friends at Hot Aisle , who build top-tier bare metal compute for AMD GPUs, kindly provided the hardware for the benchmark.