Metrics¶
Prometheus¶
When enabled, dstack
is able to collect various metrics from fleets and runs and export them
to Prometheus.
Setup¶
To enable collecting and exporting metrics to Prometheus,
set the DSTACK_ENABLE_PROMETHEUS_METRICS
environment variable, and point Prometheus to collect metrics
from the <dstack server URL>/metrics
endpoint.
NVIDIA DCGM
NVIDIA DCGM metrics are automatically collected for AWS, Azure, GCP, and OCI backends, as well as for SSH fleets.
To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the datacenter-gpu-manager-4-core
,
datacenter-gpu-manager-4-proprietary
, and datacenter-gpu-manager-exporter
packages are installed on the hosts.
Fleets¶
Fleet metrics include metrics for each instance within a fleet. This includes information such as the instance's running time, price, GPU name, and more.
Name | Type | Description | Examples |
---|---|---|---|
dstack_instance_duration_seconds_total |
counter | Total instance runtime in seconds | 1123763.22 |
dstack_instance_price_dollars_per_hour |
gauge | Instance price, USD/hour | 16.0 |
dstack_instance_gpu_count |
gauge | Instance GPU count | 4.0 , 0.0 |
Name | Type | Description | Examples |
---|---|---|---|
dstack_project_name |
string | Project name | main |
dstack_fleet_name |
string? | Fleet name | my-fleet |
dstack_fleet_id |
string? | Fleet ID | 51e837bf-fae9-4a37-ac9c-85c005606c22 |
dstack_instance_name |
string | Instance name | my-fleet-0 |
dstack_instance_id |
string | Instance ID | 8c28c52c-2f94-4a19-8c06-12f1dfee4dd2 |
dstack_instance_type |
string? | Instance type | g4dn.xlarge |
dstack_backend |
string? | Backend | aws , runpod |
dstack_gpu |
string? | GPU name | H100 |
Runs¶
Run metrics include run counters for each user in each project.
Name | Type | Description | Examples |
---|---|---|---|
dstack_run_count_total |
counter | Total runs count | 537 |
dstack_run_count_terminated_total |
counter | Terminated runs count | 118 |
dstack_run_count_failed_total |
counter | Failed runs count | 27 |
dstack_run_count_done_total |
counter | Done runs count | 218 |
Name | Type | Description | Examples |
---|---|---|---|
dstack_project_name |
string | Project name | main |
dstack_user_name |
string | User name | alice |
Run jobs¶
Run job metrics include metrics for each job within a run. This includes information such as job runtime, price, GPU name, DCGM metrics, and more.
Name | Type | Description | Examples |
---|---|---|---|
dstack_job_duration_seconds_total |
counter | Total job runtime in seconds | 520.37 |
dstack_job_price_dollars_per_hour |
gauge | Job instance price, USD/hour | 8.0 |
dstack_job_gpu_count |
gauge | Job GPU count | 2.0 , 0.0 |
dstack_job_cpu_count |
gauge | Job CPU count | 32.0 |
dstack_job_cpu_time_seconds_total |
counter | Total CPU time consumed by the job, seconds | 11.727975 |
dstack_job_memory_total_bytes |
gauge | Total memory allocated for the job, bytes | 4009754624.0 |
dstack_job_memory_usage_bytes |
gauge | Memory used by the job (including cache), bytes | 339017728.0 |
dstack_job_memory_working_set_bytes |
gauge | Memory used by the job (not including cache), bytes | 147251200.0 |
DCGM_FI_DEV_GPU_UTIL |
gauge | GPU utilization (in %) | |
DCGM_FI_DEV_MEM_COPY_UTIL |
gauge | Memory utilization (in %) | |
DCGM_FI_DEV_ENC_UTIL |
gauge | Encoder utilization (in %) | |
DCGM_FI_DEV_DEC_UTIL |
gauge | Decoder utilization (in %) | |
DCGM_FI_DEV_FB_FREE |
gauge | Framebuffer memory free (in MiB) | |
DCGM_FI_DEV_FB_USED |
gauge | Framebuffer memory used (in MiB) | |
DCGM_FI_PROF_GR_ENGINE_ACTIVE |
gauge | The ratio of cycles during which a graphics engine or compute engine remains active | |
DCGM_FI_PROF_SM_ACTIVE |
gauge | The ratio of cycles an SM has at least 1 warp assigned | |
DCGM_FI_PROF_SM_OCCUPANCY |
gauge | The ratio of number of warps resident on an SM | |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
gauge | Ratio of cycles the tensor (HMMA) pipe is active | |
DCGM_FI_PROF_PIPE_FP64_ACTIVE |
gauge | Ratio of cycles the fp64 pipes are active | |
DCGM_FI_PROF_PIPE_FP32_ACTIVE |
gauge | Ratio of cycles the fp32 pipes are active | |
DCGM_FI_PROF_PIPE_FP16_ACTIVE |
gauge | Ratio of cycles the fp16 pipes are active | |
DCGM_FI_PROF_PIPE_INT_ACTIVE |
gauge | Ratio of cycles the integer pipe is active | |
DCGM_FI_PROF_DRAM_ACTIVE |
gauge | Ratio of cycles the device memory interface is active sending or receiving data | |
DCGM_FI_PROF_PCIE_TX_BYTES |
counter | The number of bytes of active PCIe tx (transmit) data including both header and payload | |
DCGM_FI_PROF_PCIE_RX_BYTES |
counter | The number of bytes of active PCIe rx (read) data including both header and payload | |
DCGM_FI_DEV_SM_CLOCK |
gauge | SM clock frequency (in MHz) | |
DCGM_FI_DEV_MEM_CLOCK |
gauge | Memory clock frequency (in MHz) | |
DCGM_FI_DEV_MEMORY_TEMP |
gauge | Memory temperature (in C) | |
DCGM_FI_DEV_GPU_TEMP |
gauge | GPU temperature (in C) | |
DCGM_FI_DEV_POWER_USAGE |
gauge | Power draw (in W) | |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION |
counter | Total energy consumption since boot (in mJ) | |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER |
counter | Total number of PCIe retries | |
DCGM_FI_DEV_XID_ERRORS |
gauge | Value of the last XID error encountered | |
DCGM_FI_DEV_POWER_VIOLATION |
counter | Throttling duration due to power constraints (in us) | |
DCGM_FI_DEV_THERMAL_VIOLATION |
counter | Throttling duration due to thermal constraints (in us) | |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION |
counter | Throttling duration due to sync-boost constraints (in us) | |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION |
counter | Throttling duration due to board limit constraints (in us) | |
DCGM_FI_DEV_LOW_UTIL_VIOLATION |
counter | Throttling duration due to low utilization (in us) | |
DCGM_FI_DEV_RELIABILITY_VIOLATION |
counter | Throttling duration due to reliability constraints (in us) | |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL |
counter | Total number of single-bit volatile ECC errors | |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL |
counter | Total number of double-bit volatile ECC errors | |
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL |
counter | Total number of single-bit persistent ECC errors | |
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL |
counter | Total number of double-bit persistent ECC errors | |
DCGM_FI_DEV_RETIRED_SBE |
counter | Total number of retired pages due to single-bit errors | |
DCGM_FI_DEV_RETIRED_DBE |
counter | Total number of retired pages due to double-bit errors | |
DCGM_FI_DEV_RETIRED_PENDING |
counter | Total number of pages pending retirement | |
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS |
counter | Number of remapped rows for uncorrectable errors | |
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS |
counter | Number of remapped rows for correctable errors | |
DCGM_FI_DEV_ROW_REMAP_FAILURE |
gauge | Whether remapping of rows has failed | |
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL |
counter | Total number of NVLink flow-control CRC errors | |
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL |
counter | Total number of NVLink data CRC errors | |
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL |
counter | Total number of NVLink retries | |
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL |
counter | Total number of NVLink recovery errors | |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL |
counter | Total number of NVLink bandwidth counters for all lanes | |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0 |
counter | The number of bytes of active NVLink rx or tx data including both header and payload | |
DCGM_FI_PROF_NVLINK_RX_BYTES |
counter | The number of bytes of active PCIe rx (read) data including both header and payload | |
DCGM_FI_PROF_NVLINK_TX_BYTES |
counter | The number of bytes of active NvLink tx (transmit) data including both header and payload |
Label | Type | Examples | |
---|---|---|---|
dstack_project_name |
string | Project name | main |
dstack_user_name |
string | User name | alice |
dstack_run_name |
string | Run name | nccl-tests |
dstack_run_id |
string | Run ID | 51e837bf-fae9-4a37-ac9c-85c005606c22 |
dstack_job_name |
string | Job name | nccl-tests-0-0 |
dstack_job_id |
string | Job ID | 8c28c52c-2f94-4a19-8c06-12f1dfee4dd2 |
dstack_job_num |
integer | Job number | 0 |
dstack_replica_num |
integer | Replica number | 0 |
dstack_run_type |
string | Run configuration type | task , dev-environment |
dstack_backend |
string | Backend | aws , runpod |
dstack_gpu |
string? | GPU name | H100 |