DeepSeek R1 inference performance: MI300X vs. H200¶

DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought outputs, placing significant demands on memory capacity and bandwidth.

In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements.

This benchmark was made possible through the generous support of our partners at Vultr and Lambda, who provided access to the necessary hardware.

Benchmark setup¶

Hardware configurations¶

AMD 8xMI300x
- 2x Intel Xeon Platinum 8468, 48C/96T, 16GT/s, 105M Cache (350W)
- 8x AMD MI300x GPU, 192GB, 750W
- 32x 64GB DDR5, 4800MT/s
NVIDIA 8xH200 SXM5
- 2x Intel Xeon Platinum 8570, 56C/112T, 20GT/s, 300M Cache (350W)
- 8x NVIDIA H200 SXM5 GPU, 141GB, 700W
- 32x 64GB DDR5, 5600MT/s

Benchmark methodology¶

Online inference

We utilized SGLang's Deepseek-R1/bench_serving.py script, modified to incorporate TensorRT-LLM.

Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200.

Request Concurrencies	Output Token Lengths	Prefix-Cached
4,8,16,...,128	800	No
128	1600, 3200, 6400	No
128	800	Yes

To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests.

Offline inference

For offline inference, we used vLLM’s benchmark_throughput.py, modified for SGLang. TensorRT-LLM was tested using a custom benchmark_throughput_trt.py. The benchmark examined performance across various batch sizes and output token lengths.

Batch Sizes	Output Token Lengths
32,64,128,...,1024	800
256, 512, 1024	1600
256, 512, 1024	3200

Key observations¶

Throughput and End-to-End Latency¶

NVIDIA H200 performance

TensorRT-LLM outperformed both vLLM and SGLang, achieving the highest online throughput of 4176 tokens/s on H200.
At concurrencies below 128, vLLM led in online throughput and end-to-end latency.
In offline scenarios, H200 achieved the highest overall throughput of 6311 tokens/s with SGLang.

AMD MI300X performance

vLLM outperformed SGLang in both online and offline throughput and end-to-end latency.
MI300X with vLLM achieved the highest overall throughput of 4574 tokens/s in online scenarios.
At request concurrencies below 32, SGLang outperformed vLLM in online throughput and latency.

While MI300X's larger memory capacity and higher bandwidth should theoretically enable higher throughput at larger batch sizes, the results suggest that inference backends for MI300X may require further optimization to fully leverage its architectural advantages.

Throughput and Latency vs. Output Token Length¶

NVIDIA H200 performance

SGLang delivered slightly higher throughput and better latency as output token length increased in online scenarios.
In offline scenarios, SGLang with H200 outperformed MI300X as output token length increased.

ThroughputLatency

AMD MI300X performance

vLLM maintained the lead in both online and offline scenarios as output token length increased.

ThroughputLatency

Time to First Token (TTFT)¶

NVIDIA H200 performance

TensorRT-LLM maintained the lowest and most consistent TTFT up to concurrency 64.

AMD MI300X performance

vLLM achieved the lowest TTFT at concurrency 128. Below 128, vLLM and SGLang had similar TTFT.

TTFT, being compute-intensive, highlights H200's advantage, aligning with SemiAnalysis’s MI300X vs. H200 TFLOPS benchmark. However, at 128 concurrent requests, MI300X's memory capacity and bandwidth advantages become evident.

Time Per Output Token (TPOT)¶

NVIDIA H200 performance

vLLM maintained the lowest TPOT across all request concurrencies.

AMD MI300X performance

SGLang delivered the lowest TPOT up to concurrency 32. Beyond that, vLLM took the lead.

Given that TPOT is memory-bound, MI300X should have a stronger advantage with further optimizations.

TTFT vs. Output Token Length¶

NVIDIA H200 performance

SGLang demonstrated stable TTFT across increasing output token lengths.
vLLM and TensorRT-LLM showed significant increases in TTFT as output token length grew, likely due to KV cache memory pressure.

AMD MI300X performance

Both vLLM and SGLang demonstrated stable TTFT across increasing output token lengths, with vLLM maintaining lower TTFT.

TPOT vs. Output Token Length¶

NVIDIA H200 performance

SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths.

vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure.

AMD MI300X performance

Both SGLang and vLLM demonstrated stable TPOT across increasing output token lengths, with vLLM maintaining the lowest TPOT.

Prefix caching¶

NVIDIA H200 performance

vLLM outperformed SGLang in online throughput, TTFT, and end-to-end latency with prefix caching enabled. However, vLLM's TPOT increased after prefix caching, which requires further investigation.

ThroughputTTFTTPOTLatency

Limitations¶

The offline benchmark results for TensorRT-LLM were obtained using the DeepSeek-R1 model engine built from the deepseek branch. However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment.
The impact of dynamic batching on inference efficiency was not tested.
vLLM's prefix caching support for MI300X is a work in progress and can be tracked here.
The inference backends are being optimized for the DeepSeek-R1 model. Given these continuous updates, the current results reflect only the performance tested at the time of the benchmark. Overall, performance for all backends is expected to improve as more optimizations are made by the backend teams.

Source code¶

All source code and findings are available in our GitHub repo.

References¶

Acknowledgments¶

Vultr¶

Vultr provided access to 8x AMD MI300X GPUs. We are truly thankful for their support.

If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With dstack, provisioning and accessing compute via dstack is seamless and straightforward.

Lambda¶

Lambda provided access to 8x NVIDIA H200 GPUs. We are truly thankful for their support

Both Vultr and Lambda are natively supported and can be seamlessly integrated with dstack.