Supporting Intel Gaudi AI accelerators with SSH fleets¶

At dstack, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not just leading cloud providers and on-prem environments but also a wide range of accelerators.

With our latest release, we’re adding support for Intel Gaudi AI Accelerator and launching a new partnership with Intel.

About Intel Gaudi¶

Intel Gaudi AI Accelerator is a series of accelerators built to handle AI tasks. Powered by Intel’s Habana architecture, Gaudi is tailored for high-performance AI inference and training, offering high throughput and efficiency. It has a scalable design with numerous cores and ample memory bandwidth, enabling better performance per watt.

Here's a brief spec for Gaudi 2 and Gaudi 3:

	Gaudi 2	Gaudi 3
MME Units	2	8
TPC Units	24	64
HBM Capacity	96 GB	128 GB
HBM Bandwidth	2.46 TB/s	3.7 TB/s
Networking	600 GB/s	1200 GB/s
FP8 Performance	865 TFLOPs	1835 TFLOPs
BF16 Performance	432 TFLOPs	1835 TFLOPs

In the latest release, dstack now supports the orchestration of containers across on-prem machines equipped with Intel Gaudi accelerators.

Create a fleet¶

To manage container workloads on on-prem machines with Intel Gaudi accelerators, start by configuring an SSH fleet. Here’s an example configuration for your fleet:

type: fleet
name: my-gaudi2-fleet
ssh_config:
  hosts:
    - hostname: 100.83.163.67
      user: sdp
      identity_file: ~/.ssh/id_rsa
      blocks: auto
    - hostname: 100.83.163.68
      user: sdp
      identity_file: ~/.ssh/id_rsa
      blocks: auto
  proxy_jump:
    hostname: 146.152.186.135
    user: guest
    identity_file: ~/.ssh/intel_id_rsa

To provision the fleet, run the dstack apply command:

$ dstack apply -f examples/misc/fleets/gaudi.dstack.yml

Provisioning...
---> 100%

 FLEET            INSTANCE  BACKEND  GPU                        STATUS  CREATED 
 my-gaudi2-fleet  0         ssh      152xCPU, 1007GB, 8xGaudi2  idle    3 mins ago
                                     (96GB), 388.0GB (disk)     
                  1         ssh      152xCPU, 1007GB, 8xGaudi2  idle    3 mins ago
                                     (96GB), 388.0GB (disk)

Apply a configuration¶

With your fleet provisioned, you can now run dev environments, tasks, services.

Below is an example of a task configuration for fine-tuning the DeepSeek-R1-Distill-Qwen-7B model using Optimum for Intel Gaudi and DeepSpeed with the lvwerra/stack-exchange-paired dataset:

type: task
name: trl-train

image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  - WANDB_API_KEY
  - WANDB_PROJECT
commands:
   - pip install --upgrade-strategy eager optimum[habana]
   - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
   - git clone https://github.com/huggingface/optimum-habana.git
   - cd optimum-habana/examples/trl
   - pip install -r requirements.txt
   - pip install wandb
   - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py
       --model_name_or_path $MODEL_ID
       --dataset_name "lvwerra/stack-exchange-paired"
       --deepspeed ../language-modeling/llama2_ds_zero3_config.json
       --output_dir="./sft"
       --do_train
       --max_steps=500
       --logging_steps=10
       --save_steps=100
       --per_device_train_batch_size=1
       --per_device_eval_batch_size=1
       --gradient_accumulation_steps=2
       --learning_rate=1e-4
       --lr_scheduler_type="cosine"
       --warmup_steps=100
       --weight_decay=0.05
       --optim="paged_adamw_32bit"
       --lora_target_modules "q_proj" "v_proj"
       --bf16
       --remove_unused_columns=False
       --run_name="sft_deepseek_70"
       --report_to="wandb"
       --use_habana
       --use_lazy_mode

resources:
  gpu: gaudi2:8

Submit the task using the dstack apply command:

$ dstack apply -f examples/single-node-training/trl/intel/.dstack.yml -R

dstack will automatically create containers according to the run configuration and execute them across the fleet.

Explore our examples to learn how to train and deploy large models on Intel Gaudi AI Accelerator.

Intel Tiber AI Cloud

At dstack, we’re grateful to be part of the Intel Liftoff program, which allowed us to access Intel Gaudi AI accelerators via Intel Tiber AI Cloud. You can sign up if you’d like to access Intel Gaudi AI accelerators via the cloud.

Native integration with Intel Tiber AI Cloud is also coming soon to dstack.

What's next?

Refer to Quickstart
Check dev environments, tasks, services, and fleets
Join Discord