Skip to content

Deepseek

This example walks you through how to deploy and train Deepseek models with dstack.

We used Deepseek-R1 distilled models and Deepseek-V2-Lite, a 16B model with the same architecture as Deepseek-R1 (671B). Deepseek-V2-Lite retains MLA and DeepSeekMoE but requires less memory, making it ideal for testing and fine-tuning on smaller GPUs.

Prerequisites

Once dstack is installed, go ahead clone the repo, and run dstack init.

$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init

Deployment

AMD

Here's an example of a service that deploys Deepseek-R1-Distill-Llama-70B using SGLang and vLLM with AMD MI300X GPU. The below configurations also support Deepseek-V2-Lite.

type: service
name: deepseek-r1-amd

image: lmsysorg/sglang:v0.4.1.post4-rocm620
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
commands:
   - python3 -m sglang.launch_server
     --model-path $MODEL_ID
     --port 8000
     --trust-remote-code

port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B

resources:
  gpu: MI300X
  disk: 300Gb

type: service
name: deepseek-r1-amd

image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  - MAX_MODEL_LEN=126432
commands:
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN
    --trust-remote-code
port: 8000

model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B

resources:
  gpu: MI300X
  disk: 300Gb

Note, when using Deepseek-R1-Distill-Llama-70B with vLLM with a 192GB GPU, we must limit the context size to 126432 tokens to fit the memory.

Intel Gaudi

Here's an example of a service that deploys Deepseek-R1-Distill-Llama-70B using TGI on Gaudi and vLLM (Gaudi fork) with Intel Gaudi 2.

Both TGI on Gaudi and vLLM do not support Deepseek-V2-Lite. See this and this issues.

type: service

name: tgi

image: ghcr.io/huggingface/tgi-gaudi:2.3.1

auth: false
port: 8000

model: DeepSeek-R1-Distill-Llama-70B

env:
  - HF_TOKEN
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  - PORT=8000
  - OMPI_MCA_btl_vader_single_copy_mechanism=none
  - TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true
  - PT_HPU_ENABLE_LAZY_COLLECTIVES=true
  - MAX_TOTAL_TOKENS=2048
  - BATCH_BUCKET_SIZE=256
  - PREFILL_BATCH_BUCKET_SIZE=4
  - PAD_SEQUENCE_TO_MULTIPLE_OF=64
  - ENABLE_HPU_GRAPH=true
  - LIMIT_HPU_GRAPH=true
  - USE_FLASH_ATTENTION=true
  - FLASH_ATTENTION_RECOMPUTE=true

commands:
  - text-generation-launcher
      --sharded true
      --num-shard 8
      --max-input-length 1024
      --max-total-tokens 2048
      --max-batch-prefill-tokens 4096
      --max-batch-total-tokens 524288
      --max-waiting-tokens 7
      --waiting-served-ratio 1.2
      --max-concurrent-requests 512

resources:
  gpu: Gaudi2:8

type: service
name: deepseek-r1-gaudi

image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest


env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  - HABANA_VISIBLE_DEVICES=all
  - OMPI_MCA_btl_vader_single_copy_mechanism=none  

commands:
  - git clone https://github.com/HabanaAI/vllm-fork.git
  - cd vllm-fork
  - git checkout habana_main
  - pip install -r requirements-hpu.txt
  - python setup.py develop
  - vllm serve $MODEL_ID
    --tensor-parallel-size 8
    --trust-remote-code
    --download-dir /data

port: 8000

NVIDIA

Here's an example of a service that deploys Deepseek-R1-Distill-Llama-8B using SGLang and vLLM with NVIDIA GPUs. Both SGLang and vLLM also support Deepseek-V2-Lite.

type: service
name: deepseek-r1-nvidia

image: lmsysorg/sglang:latest
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
commands:
    - python3 -m sglang.launch_server
      --model-path $MODEL_ID
      --port 8000
      --trust-remote-code

port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B

resources:
  gpu: 24GB

type: service
name: deepseek-r1-nvidia

image: vllm/vllm-openai:latest
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  - MAX_MODEL_LEN=4096
commands:
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN 
port: 8000 
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B

resources:
  gpu: 24GB

Note, to run Deepseek-R1-Distill-Llama-8B with vLLM with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory.

To run Deepseek-V2-Lite with vLLM, we must use 40GB GPU and to run Deepseek-V2-Lite with SGLang, we must use 80GB GPU. For more details on SGlang's memory requirements you can refer to this issue.

Memory requirements

Approximate memory requirements for loading the model (excluding context and CUDA/ROCm kernel reservations).

Model Size FP16 FP8 INT4
Deepseek-R1 671B 1.35TB 671GB 336GB
DeepSeek-R1-Distill-Llama 70B 161GB 80.5GB 40B
DeepSeek-R1-Distill-Qwen 32B 74GB 37GB 18.5GB
DeepSeek-V2-Lite 16B 35GB 17.5GB 8.75GB
DeepSeek-R1-Distill-Qwen 14B 32GB 16GB 8GB
DeepSeek-R1-Distill-Llama 8B 18GB 9GB 4.5GB
DeepSeek-R1-Distill-Qwen 7B 16GB 8GB 4GB

For example, the FP8 version of Deepseek-R1 671B fits on a single node of MI300X with eight 192GB GPUs, a single node of H200 with eight 141GB GPUs, or a single node of Intel Gaudi2 with eight 96GB GPUs.

Applying the configuration

To run a configuration, use the dstack apply command.

$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml

 #  BACKEND  REGION     RESOURCES                         SPOT  PRICE   
 1  runpod   EU-RO-1   24xCPU, 283GB, 1xMI300X (192GB)    no    $2.49  

Submit the run deepseek-r1-amd? [y/n]: y

Provisioning...
---> 100%

Once the service is up, the model will be available via the OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/.

curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
    -X POST \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is Deep Learning?"
        }
      ],
      "stream": true,
      "max_tokens": 512
    }'

When a gateway is configured, the OpenAI-compatible endpoint is available at https://gateway.<gateway domain>/.

Fine-tuning

AMD

Here are the examples of LoRA fine-tuning of Deepseek-V2-Lite and GRPO fine-tuning of DeepSeek-R1-Distill-Qwen-1.5B on MI300X GPU using HuggingFace's TRL .

type: task
name: trl-train

image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0

env:
  - WANDB_API_KEY
  - WANDB_PROJECT
  - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
  - ACCELERATE_USE_FSDP=False
commands:
  - git clone https://github.com/huggingface/peft.git
  - pip install trl
  - pip install "numpy<2"
  - pip install peft
  - pip install wandb
  - cd peft/examples/sft
  - python train.py
    --seed 100
    --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite"
    --dataset_name "smangrul/ultrachat-10k-chatml"
    --chat_template_format "chatml"
    --add_special_tokens False
    --append_concat_token False
    --splits "train,test"
    --max_seq_len 512
    --num_train_epochs 1
    --logging_steps 5
    --log_level "info"
    --logging_strategy "steps"
    --eval_strategy "epoch"
    --save_strategy "epoch"
    --hub_private_repo True
    --hub_strategy "every_save"
    --packing True
    --learning_rate 1e-4
    --lr_scheduler_type "cosine"
    --weight_decay 1e-4
    --warmup_ratio 0.0
    --max_grad_norm 1.0
    --output_dir "deepseek-sft-lora"
    --per_device_train_batch_size 8
    --per_device_eval_batch_size 8
    --gradient_accumulation_steps 4
    --gradient_checkpointing True
    --use_reentrant True
    --dataset_text_field "content"
    --use_peft_lora True
    --lora_r 16
    --lora_alpha 16
    --lora_dropout 0.05
    --lora_target_modules "all-linear"

resources:
  gpu: MI300X
  disk: 150GB

type: task
name: trl-train-grpo

image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0 

env:
  - WANDB_API_KEY
  - WANDB_PROJECT
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
commands:
  - pip install trl
  - pip install datasets
  # numPy version less than 2 is required for the scipy installation with AMD.
  - pip install "numpy<2"
  - mkdir -p grpo_example
  - cp examples/llms/deepseek/trl/amd/grpo_train.py grpo_example/grpo_train.py
  - cd grpo_example
  - python grpo_train.py
    --model_name_or_path $MODEL_ID
    --dataset_name trl-lib/tldr
    --per_device_train_batch_size 2
    --logging_steps 25
    --output_dir Deepseek-Distill-Qwen-1.5B-GRPO
    --trust_remote_code

resources:
  gpu: MI300X
  disk: 150GB

Note, the GRPO fine-tuning of DeepSeek-R1-Distill-Qwen-1.5B consumes up to 135GB of vRAM.

Intel Gaudi

Here is an example of LoRA fine-tuning of DeepSeek-R1-Distill-Qwen-7B on Intel Gaudi 2 GPUs using HuggingFace's Optimum for Intel Gaudi and DeepSpeed . Both also support LoRA fine-tuning of Deepseek-V2-Lite with same configuration as below.

type: task
name: trl-train

image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0

env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  - WANDB_API_KEY
  - WANDB_PROJECT
commands:
  - pip install --upgrade-strategy eager optimum[habana]
  - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
  - git clone https://github.com/huggingface/optimum-habana.git
  - cd optimum-habana/examples/trl
  - pip install -r requirements.txt
  - pip install wandb
  - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py
    --model_name_or_path $MODEL_ID
    --dataset_name "lvwerra/stack-exchange-paired"
    --deepspeed ../language-modeling/llama2_ds_zero3_config.json
    --output_dir="./sft"
    --do_train
    --max_steps=500
    --logging_steps=10
    --save_steps=100
    --per_device_train_batch_size=1
    --per_device_eval_batch_size=1
    --gradient_accumulation_steps=2
    --learning_rate=1e-4
    --lr_scheduler_type="cosine"
    --warmup_steps=100
    --weight_decay=0.05
    --optim="paged_adamw_32bit"
    --lora_target_modules "q_proj" "v_proj"
    --bf16
    --remove_unused_columns=False
    --run_name="sft_deepseek_70"
    --report_to="wandb"
    --use_habana
    --use_lazy_mode

resources:
  gpu: gaudi2:8

NVIDIA

Here are examples of LoRA fine-tuning of DeepSeek-R1-Distill-Qwen-1.5B and QLoRA fine-tuning of DeepSeek-V2-Lite on NVIDIA GPU using HuggingFace's TRL library.

type: task
name: trl-train

python: "3.10"

env:
  - WANDB_API_KEY
  - WANDB_PROJECT
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
commands:
  - git clone https://github.com/huggingface/trl.git
  - pip install trl
  - pip install peft
  - pip install wandb
  - cd trl/trl/scripts
  - python sft.py
    --model_name_or_path $MODEL_ID
    --dataset_name trl-lib/Capybara
    --learning_rate 2.0e-4
    --num_train_epochs 1
    --packing
    --per_device_train_batch_size 2
    --gradient_accumulation_steps 8
    --gradient_checkpointing
    --logging_steps 25
    --eval_strategy steps
    --eval_steps 100
    --use_peft
    --lora_r 32
    --lora_alpha 16
    --report_to wandb
    --output_dir DeepSeek-R1-Distill-Qwen-1.5B-SFT

resources:
  gpu: 24GB

type: task
name: trl-train-deepseek-v2

python: "3.10"
nvcc: true
env:
  - WANDB_API_KEY
  - WANDB_PROJECT
  - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
  - ACCELERATE_USE_FSDP=False
commands:
  - git clone https://github.com/huggingface/peft.git
  - pip install trl
  - pip install peft
  - pip install wandb
  - pip install bitsandbytes
  - cd peft/examples/sft
  - python train.py
    --seed 100 
    --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite" 
    --dataset_name "smangrul/ultrachat-10k-chatml" 
    --chat_template_format "chatml" 
    --add_special_tokens False 
    --append_concat_token False 
    --splits "train,test" 
    --max_seq_len 512 
    --num_train_epochs 1 
    --logging_steps 5 
    --log_level "info" 
    --logging_strategy "steps" 
    --eval_strategy "epoch" 
    --save_strategy "epoch" 
    --hub_private_repo True 
    --hub_strategy "every_save" 
    --bf16 True 
    --packing True 
    --learning_rate 1e-4 
    --lr_scheduler_type "cosine" 
    --weight_decay 1e-4 
    --warmup_ratio 0.0 
    --max_grad_norm 1.0 
    --output_dir "mistral-sft-lora" 
    --per_device_train_batch_size 8 
    --per_device_eval_batch_size 8 
    --gradient_accumulation_steps 4 
    --gradient_checkpointing True 
    --use_reentrant True 
    --dataset_text_field "content" 
    --use_peft_lora True 
    --lora_r 16 
    --lora_alpha 16 
    --lora_dropout 0.05 
    --lora_target_modules "all-linear" 
    --use_4bit_quantization True 
    --use_nested_quant True 
    --bnb_4bit_compute_dtype "bfloat16"

resources:
  # Consumes ~25GB of vRAM for QLoRA fine-tuning deepseek-ai/DeepSeek-V2-Lite
  gpu: 48GB

Memory requirements

Model Size Full fine-tuning LoRA QLoRA
Deepseek-R1 671B 10.5TB 1.4TB 442GB
DeepSeek-R1-Distill-Llama 70B 1.09TB 151GB 46GB
DeepSeek-R1-Distill-Qwen 32B 512GB 70GB 21GB
DeepSeek-V2-Lite 16B 256GB 35GB 11GB
DeepSeek-R1-Distill-Qwen 14B 224GB 30GB 9GB
DeepSeek-R1-Distill-Llama 8B 128GB 17GB 5GB
DeepSeek-R1-Distill-Qwen 7B 112GB 15GB 4GB
DeepSeek-R1-Distill-Qwen 1.5B 24GB 3.2GB 1GB

The memory requirements assume low-rank update matrices are 1% of model parameters. In practice, a 7B model with QLoRA needs 7–10GB due to intermediate hidden states.

Fine-tuning type Calculation
Full fine-tuning 671B × 16 bytes = 10.48TB
LoRA 671B × 2 bytes + 1% of 671B × 16 bytes = 1.41TB
QLoRA(4-bit) 671B × 0.5 bytes + 1% of 671B × 16 bytes = 442GB

Source code

The source-code of this example can be found in examples/llms/deepseek .

What's next?

  1. Check dev environments, tasks, services, and protips.