If you've configured the gcp backend in dstack, you can run dev environments, tasks, and services on TPUs.
Choose a TPU instance by specifying the TPU version and the number of cores (e.g. v5litepod-8) in the gpu property under resources,
or request TPUs by specifying tpu as vendor (see examples).
Below are a few examples on using TPUs for deployment and fine-tuning.
Multi-host TPUs
Currently, dstack supports only single-host TPUs, which means that
the maximum supported number of cores is 8 (e.g. v2-8, v3-8, v5litepod-8, v5p-8, v6e-8).
Multi-host TPU support is on the roadmap.
TPU storage
By default, each TPU VM contains a 100GB boot disk and its size cannot be changed.
If you need more storage, attach additional disks using Volumes.
Many serving frameworks including vLLM and TGI have TPU support.
Here's an example of a service that deploys Llama 3.1 8B using
Optimum TPU
and vLLM .
type:servicename:llama31-service-optimum-tpuimage:dstackai/optimum-tpu:llama31env:-HF_TOKEN-MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct-MAX_TOTAL_TOKENS=4096-MAX_BATCH_PREFILL_TOKENS=4095commands:-text-generation-launcher --port 8000port:8000# Register the modelmodel:meta-llama/Meta-Llama-3.1-8B-Instructresources:gpu:v5litepod-4examples/deployment/tgi/tpu/.dstack.yml
Note that for Optimum TPU MAX_INPUT_TOKEN is set to 4095 by default. We must also set MAX_BATCH_PREFILL_TOKENS to 4095.
Docker image
The official Docker image huggingface/optimum-tpu:latest doesn’t support Llama 3.1-8B.
We’ve created a custom image with the fix: dstackai/optimum-tpu:llama31.
Once the pull request is merged,
the official Docker image can be used.
int8 quantization still requires the same memory because the weights are first moved to the TPU in bfloat16, and then converted to int8. See the pull request for more details.
Once the configuration is ready, run dstack apply -f <configuration file>, and dstack will automatically provision the
cloud resources and run the configuration.