SGLang¶
This example shows how to deploy Qwen/Qwen3.6-27B using
SGLang and dstack.
For a
DeepSeek-V4-Prodeployment onB200:8, see the DeepSeek V4 model page.
Apply a configuration¶
Here's an example of a service that deploys
Qwen/Qwen3.6-27B using SGLang.
type: service
name: qwen36
image: lmsysorg/sglang:v0.5.10.post1
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
shm_size: 16GB
gpu: H100:4
type: service
name: qwen36
image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
commands:
- |
sglang serve \
--model-path Qwen/Qwen3.6-27B \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--reasoning-parser qwen3 \
--mem-fraction-static 0.8 \
--context-length 262144
port: 30000
model: Qwen/Qwen3.6-27B
volumes:
- instance_path: /root/.cache
path: /root/.cache
optional: true
resources:
cpu: 52..
memory: 896GB..
shm_size: 16GB
disk: 450GB..
gpu: MI300X:4
The AMD example keeps the deployment close to the upstream Qwen and SGLang
guidance: a pinned ROCm image, tensor parallelism across all four GPUs, and the
standard qwen3 reasoning parser without extra ROCm-specific tuning flags.
Save one of the configurations above as service.dstack.yml, then use the
dstack apply command.
$ dstack apply -f service.dstack.yml
If no gateway is created, the service endpoint will be available at <dstack server URL>/proxy/services/<project name>/<run name>/.
curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <user token>' \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3.6-27B",
"messages": [
{
"role": "user",
"content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with just the dollar amount."
}
],
"separate_reasoning": true,
"max_tokens": 1024
}'
Qwen3.6 uses thinking mode by default. To disable thinking, pass
"chat_template_kwargs": {"enable_thinking": false} in the request body. To
enable tool calling, add --tool-call-parser qwen3_coder to the serve command.
If a gateway is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at
https://qwen36.<gateway domain>/.
Configuration options¶
PD disaggregation¶
To run SGLang with PD disaggregation, use replica groups: one for Shepherd Model Gateway (SMG), one for prefill workers, and one for decode workers.
type: service
name: prefill-decode
image: lmsysorg/sglang:v0.5.10.post1
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
replicas:
- count: 1
# For now replica group with router must have count: 1
commands:
- pip install smg
- |
smg launch \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
resources:
cpu: 4
router:
type: sglang
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15s
With the
sglangrouter, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
The example below deploys Qwen/Qwen2.5-72B-Instruct on a multi-node cluster with AMD MI300X GPUs:
type: service
name: amd-sglang-pd-service
image: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260427
privileged: true
env:
- MODEL_ID=Qwen/Qwen2.5-72B-Instruct
- HF_TOKEN
- SGLANG_USE_AITER=0
- SGLANG_ROCM_FUSED_DECODE_MLA=0
- SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
- SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600
- RDMA_DEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7
- NCCL_IB_DISABLE=1
replicas:
- count: 1
commands:
- pip install smg
- |
smg launch \
--pd-disaggregation \
--host 0.0.0.0 \
--port 30000
resources:
cpu: 4..
router:
type: sglang
- count: 1..2
scaling:
metric: rps
target: 300
commands:
- |
python3 -m sglang.launch_server \
--model $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--trust-remote-code \
--disaggregation-ib-device $RDMA_DEVICES \
--disaggregation-bootstrap-port 8998 \
--disable-radix-cache \
--disable-cuda-graph \
--disable-overlap-schedule \
--mem-fraction-static 0.8 \
--max-running-requests 1024
resources:
gpu: MI300X:8
cpu: 96..
memory: 512GB..
- count: 1..4
scaling:
metric: rps
target: 300
commands:
- |
python3 -m sglang.launch_server \
--model $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 30000 \
--tp $DSTACK_GPUS_NUM \
--trust-remote-code \
--disaggregation-ib-device $RDMA_DEVICES \
--disable-radix-cache \
--disable-cuda-graph \
--disable-overlap-schedule \
--decode-attention-backend triton \
--mem-fraction-static 0.8 \
--max-running-requests 1024
resources:
gpu: MI300X:8
cpu: 96..
memory: 512GB..
port: 30000
model: Qwen/Qwen2.5-72B-Instruct
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15s
volumes:
- /usr/lib64/libibverbs/libbnxt_re-rdmav34.so:/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so
RoCE library
Mooncake uses the RDMA/RoCE interconnect for KV Cache transer. To use the RDMA/RoCE interconnect on Broadcom bnxt_re devices, Mooncake requires the Broadcom-specific userspace provider library libbnxt_re-rdmav34.so to be available inside the container at /usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so. We make this library available by mounting the host provider library from /usr/lib64/libibverbs/libbnxt_re-rdmav34.so.
Currently, auto-scaling only supports rps as the metric. TTFT and ITL metrics are coming soon.
Cluster
PD disaggregation requires the service to run in a fleet with placement set to cluster, because the replicas require an interconnect between instances.
While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
What's next?¶
- Read about services and gateways
- Browse the Qwen 3.6 SGLang cookbook and the SGLang server arguments reference