Exporting GPU, cost, and other metrics to Prometheus
Why Prometheus
Effective AI infrastructure management requires full visibility into compute performance and costs. AI researchers need detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage across projects.
While dstack
provides key metrics through its UI and dstack metrics
CLI, teams often need more granular data and prefer
using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected
metrics—covering fleets and runs—directly to Prometheus.