Intel Gaudi¶
dstack
supports running dev environments, tasks, and services on Intel Gaudi GPUs via
SSH fleets.
Deployment¶
Serving frameworks like vLLM and TGI have Intel Gaudi support. Here's an example of
a service that deploys
DeepSeek-R1-Distill-Llama-70B
using TGI on Gaudi
and vLLM .
type: service
name: tgi
image: ghcr.io/huggingface/tgi-gaudi:2.3.1
env:
- HF_TOKEN
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- PORT=8000
- OMPI_MCA_btl_vader_single_copy_mechanism=none
- TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true
- PT_HPU_ENABLE_LAZY_COLLECTIVES=true
- MAX_TOTAL_TOKENS=2048
- BATCH_BUCKET_SIZE=256
- PREFILL_BATCH_BUCKET_SIZE=4
- PAD_SEQUENCE_TO_MULTIPLE_OF=64
- ENABLE_HPU_GRAPH=true
- LIMIT_HPU_GRAPH=true
- USE_FLASH_ATTENTION=true
- FLASH_ATTENTION_RECOMPUTE=true
commands:
- text-generation-launcher
--sharded true
--num-shard $DSTACK_GPUS_NUM
--max-input-length 1024
--max-total-tokens 2048
--max-batch-prefill-tokens 4096
--max-batch-total-tokens 524288
--max-waiting-tokens 7
--waiting-served-ratio 1.2
--max-concurrent-requests 512
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
gpu: gaudi2:8
# Uncomment to cache downloaded models
#volumes:
# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
type: service
name: deepseek-r1-gaudi
image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- HABANA_VISIBLE_DEVICES=all
- OMPI_MCA_btl_vader_single_copy_mechanism=none
commands:
- git clone https://github.com/HabanaAI/vllm-fork.git
- cd vllm-fork
- git checkout habana_main
- pip install -r requirements-hpu.txt
- python setup.py develop
- vllm serve $MODEL_ID
--tensor-parallel-size 8
--trust-remote-code
--download-dir /data
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
gpu: gaudi2:8
# Uncomment to cache downloaded models
#volumes:
# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
Fine-tuning¶
Below is an example of LoRA fine-tuning of DeepSeek-R1-Distill-Qwen-7B
using Optimum for Intel Gaudi
and DeepSpeed with
the lvwerra/stack-exchange-paired
dataset.
type: task
name: trl-train
image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- WANDB_API_KEY
- WANDB_PROJECT
commands:
- pip install --upgrade-strategy eager optimum[habana]
- pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
- git clone https://github.com/huggingface/optimum-habana.git
- cd optimum-habana/examples/trl
- pip install -r requirements.txt
- pip install wandb
- DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py
--model_name_or_path $MODEL_ID
--dataset_name "lvwerra/stack-exchange-paired"
--deepspeed ../language-modeling/llama2_ds_zero3_config.json
--output_dir="./sft"
--do_train
--max_steps=500
--logging_steps=10
--save_steps=100
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--gradient_accumulation_steps=2
--learning_rate=1e-4
--lr_scheduler_type="cosine"
--warmup_steps=100
--weight_decay=0.05
--optim="paged_adamw_32bit"
--lora_target_modules "q_proj" "v_proj"
--bf16
--remove_unused_columns=False
--run_name="sft_deepseek_70"
--report_to="wandb"
--use_habana
--use_lazy_mode
resources:
gpu: gaudi2:8
To finetune DeepSeek-R1-Distill-Llama-70B
with eight Gaudi 2,
you can partially offload parameters to CPU memory using the Deepspeed configuration file.
For more details, refer to parameter offloading.
Applying a configuration¶
Once the configuration is ready, run dstack apply -f <configuration file>
.
$ dstack apply -f examples/deployment/vllm/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 ssh remote 152xCPU,1007GB,8xGaudi2:96GB yes $0 idle
Submit a new run? [y/n]: y
Provisioning...
---> 100%
Source code¶
The source-code of this example can be found in
examples/llms/deepseek/tgi/intel
,
examples/llms/deepseek/vllm/intel
and
examples/llms/deepseek/trl/intel
.
What's next?
- Check dev environments, tasks, and services.
- See also Intel Gaudi Documentation , vLLM Inference with Gaudi and Optimum for Gaudi examples .