Using SSH fleets with TensorWave's private AMD cloud¶
Since last month, when we introduced support for private clouds and data centers, it has become easier to use dstack
to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters.
In this tutorial, we’ll walk you through how dstack
can be used with
TensorWave using
SSH fleets.
TensorWave is a cloud provider specializing in large-scale AMD GPU clusters for both training and inference.
Before following this tutorial, ensure you have access to a cluster. You’ll see the cluster and its nodes in your TensorWave dashboard.
Creating a fleet¶
Prerequisites
Once dstack
is installed, create a project repo folder and run dstack init
.
$ mkdir tensorwave-demo && cd tensorwave-demo
$ dstack init
Now, define an SSH fleet configuration by listing the IP addresses of each node in the cluster, along with the SSH user and SSH key configured for each host.
type: fleet
name: my-tensorwave-fleet
placement: cluster
ssh_config:
user: dstack
identity_file: ~/.ssh/id_rsa
hosts:
- hostname: 64.139.222.107
blocks: auto
- hostname: 64.139.222.108
blocks: auto
You can set blocks
to auto
if you want to run concurrent workloads on each instance.
Otherwise, you can omit this property.
Once the configuration is ready, apply it using dstack apply
:
$ dstack apply -f fleet.dstack.yml
Provisioning...
---> 100%
FLEET INSTANCE RESOURCES STATUS CREATED
my-tensorwave-fleet 0 8xMI300X (192GB) 0/8 busy 3 mins ago
1 8xMI300X (192GB) 0/8 busy 3 mins ago
dstack
will automatically connect to each host, detect the hardware, install dependencies, and make them ready for
workloads.
Running workloads¶
Once the fleet is created, you can use dstack
to run workloads.
Dev environments¶
A dev environment lets you access an instance through your desktop IDE.
type: dev-environment
name: vscode
image: rocm/pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.4.0
ide: vscode
resources:
gpu: MI300X:8
Apply the configuration via dstack apply
:
$ dstack apply -f .dstack.yml
Submit the run `vscode`? [y/n]: y
Launching `vscode`...
---> 100%
To open in VS Code Desktop, use this link:
vscode://vscode-remote/ssh-remote+vscode/workflow
Open the link to access the dev environment using your desktop IDE.
Tasks¶
A task allows you to schedule a job or run a web app. Tasks can be distributed and support port forwarding.
Below is a distributed training task configuration:
type: task
name: train-distrib
nodes: 2
image: rocm/pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.4.0
commands:
- pip install torch
- export NCCL_IB_GID_INDEX=3
- export NCCL_NET_GDR_LEVEL=0
- torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$DSTACK_NODE_RANK --master_port=29600 --master_addr=$DSTACK_MASTER_NODE_IP test/tensorwave/multinode.py 5000 50
resources:
gpu: MI300X:8
Run the configuration via dstack apply
:
$ dstack apply -f train.dstack.yml
Submit the run `streamlit`? [y/n]: y
Provisioning `train-distrib`...
---> 100%
dstack
automatically runs the container on each node while passing
system environment variables
which you can use with torchrun
, accelerate
, or other distributed frameworks.
Services¶
A service allows you to deploy a model or any web app as a scalable and secure endpoint.
Create the following configuration file inside the repo:
type: service
name: deepseek-r1-sglang
image: rocm/sglang-staging:20250212
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1
- HSA_NO_SCRATCH_RECLAIM=1
commands:
- python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code
port: 8000
model: deepseek-ai/DeepSeek-R1
resources:
gpu: mi300x:8
volumes:
- /root/.cache/huggingface:/root/.cache/huggingface
Run the configuration via dstack apply
:
$ dstack apply -f deepseek.dstack.yml
Submit the run `deepseek-r1-sglang`? [y/n]: y
Provisioning `deepseek-r1-sglang`...
---> 100%
Service is published at:
http://localhost:3000/proxy/services/main/deepseek-r1-sglang/
Model deepseek-ai/DeepSeek-R1 is published at:
http://localhost:3000/proxy/models/main/
See it in action¶
Want to see how it works? Check out the video below:
What's next?
- See SSH fleets
- Read about dev environments, tasks, and services
- Join Discord