DeepSeek V4¶
This example shows how to deploy deepseek-ai/DeepSeek-V4-Pro as a
service using
SGLang and dstack.
Apply a configuration¶
Save the following configuration as deepseek-v4.dstack.yml.
type: service
name: deepseek-v4
image: lmsysorg/sglang:deepseek-v4-blackwell
env:
- HF_TOKEN
- SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
- SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
commands:
- |
sglang serve \
--trust-remote-code \
--model-path deepseek-ai/DeepSeek-V4-Pro \
--tp 8 \
--dp 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.82 \
--cuda-graph-max-bs 64 \
--max-running-requests 256 \
--deepep-config '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' \
--tool-call-parser deepseekv4 \
--reasoning-parser deepseek-v4 \
--host 0.0.0.0 \
--port 30000
port: 30000
model: deepseek-ai/DeepSeek-V4-Pro
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
gpu: B200:8
shm_size: 32GB
disk: 2TB..
This configuration uses the single-node Blackwell DeepSeek-V4-Pro recipe
shape for 8 x NVIDIA B200.
Export your Hugging Face token and apply the configuration with
dstack apply.
$ export HF_TOKEN=<your-hf-token>
$ dstack apply -f deepseek-v4.dstack.yml
If no gateway is created, the service endpoint will be available at
<dstack server URL>/proxy/services/<project name>/<run name>/.
curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [
{
"role": "user",
"content": "What is 15% of 240? Reply with just the number."
}
],
"temperature": 0,
"max_tokens": 32
}'
Reasoning mode¶
To separate the model's reasoning into reasoning_content, keep
--reasoning-parser deepseek-v4 in the server command and send
chat_template_kwargs in the request body.
For raw HTTP requests, chat_template_kwargs and separate_reasoning must be
top-level JSON fields.
curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [
{
"role": "user",
"content": "Solve step by step: If 3x + 5 = 20, what is x?"
}
],
"temperature": 0,
"max_tokens": 256,
"chat_template_kwargs": {
"thinking": true
},
"separate_reasoning": true
}'
This returns both:
reasoning_content: a separate reasoning tracecontent: the final user-visible answer
Deployment notes¶
- The first startup can take several minutes while the model loads and SGLang finishes initialization.
- The optional
/root/.cacheinstance volume helps reuse the model cache on backends that support instance volumes.
What's next?¶
- Read the DeepSeek-V4-Pro model card
- Read the DeepSeek-V4 SGLang cookbook
- Browse the dedicated SGLang and vLLM examples