Skip to content

TensorRT-LLM

This example shows how to deploy nvidia/Qwen3-235B-A22B-FP8 using TensorRT-LLM and dstack.

Apply a configuration

Here's an example of a service that deploys nvidia/Qwen3-235B-A22B-FP8 using TensorRT-LLM.

type: service
name: qwen235

image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc11

env:
  - HF_HUB_ENABLE_HF_TRANSFER=1

commands:
  - pip install hf_transfer
  - |
    trtllm-serve serve nvidia/Qwen3-235B-A22B-FP8 \
      --host 0.0.0.0 \
      --port 8000 \
      --backend pytorch \
      --tp_size $DSTACK_GPUS_NUM \
      --max_batch_size 32 \
      --max_num_tokens 4096 \
      --kv_cache_free_gpu_memory_fraction 0.75

port: 8000
model: nvidia/Qwen3-235B-A22B-FP8

volumes:
  - instance_path: /root/.cache
    path: /root/.cache
    optional: true

resources:
  cpu: 96..
  memory: 512GB..
  shm_size: 32GB
  disk: 1000GB..
  gpu: H100:8

Apply it with dstack apply:

$ dstack apply -f qwen235.dstack.yml

Access the endpoint

If no gateway is created, the service endpoint will be available at <dstack server URL>/proxy/services/<project name>/<run name>/.

$ curl http://127.0.0.1:3000/proxy/services/main/qwen235/v1/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "nvidia/Qwen3-235B-A22B-FP8",
      "messages": [
        {
          "role": "user",
          "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"
        }
      ],
      "chat_template_kwargs": {"enable_thinking": true},
      "max_tokens": 1024,
      "temperature": 0.0
    }'

When a gateway is configured, the service endpoint will be available at https://qwen235.<gateway domain>/.

What's next?

  1. Read about services and gateways
  2. Browse the TensorRT-LLM deployment guides and the Qwen3 deployment guide
  3. See the trtllm-serve reference