Skip to content

Supporting Intel Gaudi AI accelerators

At dstack, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not just leading cloud providers and on-prem environments but also a wide range of accelerators.

With our latest release, we’re adding support for Intel Gaudi AI Accelerator and launching a new partnership with Intel.

About Intel Gaudi

Intel Gaudi AI Accelerator is a series of accelerators built to handle AI tasks. Powered by Intel’s Habana architecture, Gaudi is tailored for high-performance AI inference and training, offering high throughput and efficiency. It has a scalable design with numerous cores and ample memory bandwidth, enabling better performance per watt.

Here's a brief spec for Gaudi 2 and Gaudi 3:

Gaudi 2 Gaudi 3
MME Units 2 8
TPC Units 24 64
HBM Capacity 96 GB 128 GB
HBM Bandwidth 2.46 TB/s 3.7 TB/s
Networking 600 GB/s 1200 GB/s
FP8 Performance 865 TFLOPs 1835 TFLOPs
BF16 Performance 432 TFLOPs 1835 TFLOPs

In the latest release, dstack now supports the orchestration of containers across on-prem machines equipped with Intel Gaudi accelerators.

Create a fleet

To manage container workloads on on-prem machines with Intel Gaudi accelerators, start by configuring an SSH fleet. Here’s an example configuration for your fleet:

type: fleet
name: my-gaudi2-fleet
ssh_config:
  hosts:
    - hostname: 100.83.163.67
      user: sdp
      identity_file: ~/.ssh/id_rsa
      blocks: auto
    - hostname: 100.83.163.68
      user: sdp
      identity_file: ~/.ssh/id_rsa
      blocks: auto
  proxy_jump:
    hostname: 146.152.186.135
    user: guest
    identity_file: ~/.ssh/intel_id_rsa

To provision the fleet, run the dstack apply command:

$ dstack apply -f examples/misc/fleets/gaudi.dstack.yml

Provisioning...
---> 100%

 FLEET            INSTANCE  BACKEND  GPU                        STATUS  CREATED 
 my-gaudi2-fleet  0         ssh      152xCPU, 1007GB, 8xGaudi2  idle    3 mins ago
                                     (96GB), 388.0GB (disk)     
                  1         ssh      152xCPU, 1007GB, 8xGaudi2  idle    3 mins ago
                                     (96GB), 388.0GB (disk)     

Apply a configuration

With your fleet provisioned, you can now run dev environments, tasks, services.

Below is an example of a task configuration for fine-tuning the DeepSeek-R1-Distill-Qwen-7B model using Optimum for Intel Gaudi and DeepSpeed with the lvwerra/stack-exchange-paired dataset:

type: task
name: trl-train

image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  - WANDB_API_KEY
  - WANDB_PROJECT
commands:
   - pip install --upgrade-strategy eager optimum[habana]
   - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
   - git clone https://github.com/huggingface/optimum-habana.git
   - cd optimum-habana/examples/trl
   - pip install -r requirements.txt
   - pip install wandb
   - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py
       --model_name_or_path $MODEL_ID
       --dataset_name "lvwerra/stack-exchange-paired"
       --deepspeed ../language-modeling/llama2_ds_zero3_config.json
       --output_dir="./sft"
       --do_train
       --max_steps=500
       --logging_steps=10
       --save_steps=100
       --per_device_train_batch_size=1
       --per_device_eval_batch_size=1
       --gradient_accumulation_steps=2
       --learning_rate=1e-4
       --lr_scheduler_type="cosine"
       --warmup_steps=100
       --weight_decay=0.05
       --optim="paged_adamw_32bit"
       --lora_target_modules "q_proj" "v_proj"
       --bf16
       --remove_unused_columns=False
       --run_name="sft_deepseek_70"
       --report_to="wandb"
       --use_habana
       --use_lazy_mode

resources:
  gpu: gaudi2:8

Submit the task using the dstack apply command:

$ dstack apply -f examples/fine-tuning/trl/intel/.dstack.yml -R

dstack will automatically create containers according to the run configuration and execute them across the fleet.

Explore our examples to learn how to train and deploy large models on Intel Gaudi AI Accelerator.

Intel Tiber AI Cloud

At dstack, we’re grateful to be part of the Intel Liftoff program, which allowed us to access Intel Gaudi AI accelerators via Intel Tiber AI Cloud . You can sign up if you’d like to access Intel Gaudi AI accelerators via the cloud.

Native integration with Intel Tiber AI Cloud is also coming soon to dstack.

What's next?

  1. Refer to Quickstart
  2. Check dev environments, tasks, services, and fleets
  3. Join Discord