You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
all-smi provides comprehensive hardware metrics in Prometheus format through its API mode. This document details all available metrics across different hardware platforms.
Starting API Mode
# Start API server on TCP port
all-smi api --port 9090
# Custom update interval (default: 3 seconds)
all-smi api --port 9090 --interval 5
# Include process information
all-smi api --port 9090 --processes
Metrics are available at http://localhost:9090/metrics
Unix Domain Socket Support (Unix Only)
For local IPC scenarios, API mode supports Unix Domain Sockets:
# Use default socket path
all-smi api --socket
# Linux: /var/run/all-smi.sock (or /tmp/all-smi.sock)# macOS: /tmp/all-smi.sock# Use custom socket path
all-smi api --socket /custom/path/all-smi.sock
# TCP and Unix socket simultaneously
all-smi api --port 9090 --socket
# Unix socket only (disable TCP)
all-smi api --port 0 --socket
Security: Socket permissions are set to 0600 (owner-only access).
Available Metrics
GPU Metrics (All Platforms)
Metric
Description
Unit
Labels
all_smi_gpu_utilization
GPU utilization percentage
percent
gpu_index, gpu_name
all_smi_gpu_memory_used_bytes
GPU memory used
bytes
gpu_index, gpu_name
all_smi_gpu_memory_total_bytes
GPU memory total
bytes
gpu_index, gpu_name
all_smi_gpu_temperature_celsius
GPU temperature
celsius
gpu_index, gpu_name
all_smi_gpu_power_consumption_watts
GPU power consumption
watts
gpu_index, gpu_name
all_smi_gpu_frequency_mhz
GPU frequency
MHz
gpu_index, gpu_name
all_smi_gpu_info
GPU device information
info
gpu_index, gpu_name, driver_version
Unified AI Acceleration Library Labels
The all_smi_gpu_info metric includes standardized labels for AI acceleration libraries across all GPU/accelerator platforms. These unified labels allow platform-agnostic queries and dashboards:
Label
Description
Example Values
lib_name
Name of the AI acceleration library
CUDA, ROCm, Metal
lib_version
Version of the AI acceleration library
13.0, 7.0.2, Metal 3
Platform-Specific Library Mappings
Platform
lib_name
lib_version Source
Platform-Specific Label
NVIDIA GPU
CUDA
CUDA version
cuda_version
AMD GPU
ROCm
ROCm version
rocm_version
NVIDIA Jetson
CUDA
CUDA version
cuda_version
Apple Silicon
Metal
Metal version
N/A
Note: Platform-specific labels (e.g., cuda_version, rocm_version) are maintained for backward compatibility with existing queries and dashboards.
Example PromQL Queries
# Count devices by AI library type
count by (lib_name) (all_smi_gpu_info)
# Get all CUDA devices with version 12 or higher
all_smi_gpu_info{lib_name="CUDA", lib_version=~"1[2-9].*|[2-9][0-9].*"}
# Alert on outdated ROCm versions (< 7.0)
all_smi_gpu_info{lib_name="ROCm", lib_version!~"[7-9].*"} == 1
# Cross-platform library distribution
sum by (lib_name, lib_version) (all_smi_gpu_info)
# Find all devices using Metal (Apple Silicon)
all_smi_gpu_info{lib_name="Metal"}
# Monitor library version consistency across cluster
count by (lib_name, lib_version) (all_smi_gpu_info) > 1
NVIDIA GPU Specific Metrics
Metric
Description
Unit
Labels
all_smi_gpu_pcie_gen_current
Current PCIe generation
-
gpu_index, gpu_name
all_smi_gpu_pcie_width_current
Current PCIe link width
-
gpu_index, gpu_name
all_smi_gpu_performance_state
GPU performance state (P0=0, P1=1, etc.)
-
gpu_index, gpu_name
all_smi_gpu_clock_graphics_max_mhz
Maximum graphics clock
MHz
gpu_index, gpu_name
all_smi_gpu_clock_memory_max_mhz
Maximum memory clock
MHz
gpu_index, gpu_name
all_smi_gpu_power_limit_current_watts
Current power limit
watts
gpu_index, gpu_name
all_smi_gpu_power_limit_max_watts
Maximum power limit
watts
gpu_index, gpu_name
NVIDIA Jetson Specific Metrics
Metric
Description
Unit
Labels
all_smi_dla_utilization
DLA (Deep Learning Accelerator) utilization
percent
gpu_index, gpu_name
AMD GPU Specific Metrics
AMD GPUs (Radeon and Instinct series) provide comprehensive monitoring through ROCm and the DRM subsystem:
Metric
Description
Unit
Labels
all_smi_gpu_fan_speed_rpm
GPU fan speed
RPM
gpu_index, gpu_name
all_smi_amd_rocm_version
AMD ROCm version installed
info
instance, version
all_smi_gpu_memory_gtt_bytes
GTT (GPU Translation Table) memory usage
bytes
gpu_index, gpu_name
all_smi_gpu_memory_vram_bytes
VRAM (Video RAM) usage
bytes
gpu_index, gpu_name
Additional Details Available (in all_smi_gpu_info labels):
Driver Version: AMDGPU kernel driver version (e.g., "30.10.1")
ROCm Version: ROCm software stack version (e.g., "7.0.2")
PCIe Information: Current link generation and width, max GPU/system link capabilities
VBIOS: Version and date information
Power Management: Current, minimum, and maximum power cap values
ASIC Information: Device ID, revision ID, ASIC name
Memory Clock: Current memory clock frequency
Process Tracking:
AMD GPU process detection uses fdinfo from /proc/<pid>/fdinfo/ for accurate memory tracking
Tracks both VRAM and GTT memory usage per process
Available with --processes flag in API mode
Platform Requirements:
Requires ROCm drivers and libamdgpu_top library
Requires sudo access to /dev/dri devices or user in video/render groups
Only available in glibc builds (not musl static builds)
Apple Silicon GPU Specific Metrics
Metric
Description
Unit
Labels
all_smi_ane_utilization
ANE utilization
mW
gpu_index, gpu_name
all_smi_ane_power_watts
ANE power consumption
watts
gpu_index, gpu_name
all_smi_thermal_pressure_info
Thermal pressure level
info
gpu_index, gpu_name, level
Note: For Apple Silicon (M1/M2/M3/M4), gpu_temperature_celsius is not available; thermal pressure level is provided instead.
Note: Tenstorrent NPUs use the same basic metric names as GPUs for compatibility with existing monitoring infrastructure. Additional Tenstorrent-specific metrics provide detailed hardware monitoring capabilities.
Rebellions NPU Metrics
Basic NPU Metrics
Metric
Description
Unit
Labels
all_smi_gpu_utilization
NPU utilization percentage
percent
gpu_index, gpu_name
all_smi_gpu_memory_used_bytes
NPU memory used
bytes
gpu_index, gpu_name
all_smi_gpu_memory_total_bytes
NPU memory total
bytes
gpu_index, gpu_name
all_smi_gpu_temperature_celsius
NPU temperature
celsius
gpu_index, gpu_name
all_smi_gpu_power_consumption_watts
NPU power consumption
watts
gpu_index, gpu_name
all_smi_gpu_frequency_mhz
NPU clock frequency
MHz
gpu_index, gpu_name
all_smi_gpu_info
NPU device information
info
gpu_index, gpu_name, driver_version
Rebellions-Specific Metrics
Metric
Description
Unit
Labels
all_smi_rebellions_device_info
Device model and variant information
info
npu, instance, uuid, index, model, variant
all_smi_rebellions_firmware_info
NPU firmware version
info
npu, instance, uuid, index, firmware_version
all_smi_rebellions_kmd_info
Kernel Mode Driver version
info
npu, instance, uuid, index, kmd_version
all_smi_rebellions_device_status
Device operational status
gauge
npu, instance, uuid, index
all_smi_rebellions_performance_state
NPU performance state (P0-P15)
gauge
npu, instance, uuid, index
all_smi_rebellions_pcie_generation
PCIe generation (Gen4)
gauge
npu, instance, uuid, index
all_smi_rebellions_pcie_width
PCIe link width (x16)
gauge
npu, instance, uuid, index
all_smi_rebellions_memory_bandwidth_gbps
Memory bandwidth capacity
gauge
npu, instance, uuid, index
all_smi_rebellions_compute_tops
Compute capacity in TOPS
gauge
npu, instance, uuid, index
Note: Rebellions NPUs support ATOM, ATOM+, and ATOM Max variants with varying compute and memory capabilities. All variants use PCIe Gen4 x16 interface.
Furiosa NPU Metrics
Basic NPU Metrics
Metric
Description
Unit
Labels
all_smi_gpu_utilization
NPU utilization percentage
percent
gpu_index, gpu_name
all_smi_gpu_memory_used_bytes
NPU memory used
bytes
gpu_index, gpu_name
all_smi_gpu_memory_total_bytes
NPU memory total
bytes
gpu_index, gpu_name
all_smi_gpu_temperature_celsius
NPU temperature
celsius
gpu_index, gpu_name
all_smi_gpu_power_consumption_watts
NPU power consumption
watts
gpu_index, gpu_name
all_smi_gpu_frequency_mhz
NPU clock frequency
MHz
gpu_index, gpu_name
all_smi_gpu_info
NPU device information
info
gpu_index, gpu_name, driver_version
Furiosa-Specific Metrics
Metric
Description
Unit
Labels
all_smi_furiosa_device_info
Device architecture and model info
info
npu, instance, uuid, index, architecture, model
all_smi_furiosa_firmware_info
NPU firmware version
info
npu, instance, uuid, index, firmware_version
all_smi_furiosa_pert_info
PERT (runtime) version
info
npu, instance, uuid, index, pert_version
all_smi_furiosa_liveness_status
Device liveness status
gauge
npu, instance, uuid, index
all_smi_furiosa_core_count
Number of cores in NPU
gauge
npu, instance, uuid, index
all_smi_furiosa_core_status
Core availability status
gauge
npu, instance, uuid, index, core
all_smi_furiosa_pe_utilization
Processing Element utilization
percent
npu, instance, uuid, index, core
all_smi_furiosa_core_frequency_mhz
Per-core frequency
MHz
npu, instance, uuid, index, core
all_smi_furiosa_power_governor_info
Power governor mode
info
npu, instance, uuid, index, governor
all_smi_furiosa_error_count
Cumulative error count
counter
npu, instance, uuid, index
all_smi_furiosa_pcie_generation
PCIe generation
gauge
npu, instance, uuid, index
all_smi_furiosa_pcie_width
PCIe link width
gauge
npu, instance, uuid, index
all_smi_furiosa_memory_bandwidth_utilization
Memory bandwidth utilization
percent
npu, instance, uuid, index
Note: Furiosa NPUs use the RNGD architecture with 8 cores per NPU. Each core contains multiple Processing Elements (PEs) that handle neural network computations. The power governor supports OnDemand mode for dynamic power management.
Intel Gaudi NPU Metrics
Basic NPU Metrics
Metric
Description
Unit
Labels
all_smi_gpu_utilization
NPU utilization percentage
percent
gpu_index, gpu_name
all_smi_gpu_memory_used_bytes
NPU memory used
bytes
gpu_index, gpu_name
all_smi_gpu_memory_total_bytes
NPU memory total
bytes
gpu_index, gpu_name
all_smi_gpu_temperature_celsius
NPU temperature
celsius
gpu_index, gpu_name
all_smi_gpu_power_consumption_watts
NPU power consumption
watts
gpu_index, gpu_name
all_smi_gpu_frequency_mhz
NPU clock frequency
MHz
gpu_index, gpu_name
all_smi_gpu_info
NPU device information
info
gpu_index, gpu_name, driver_version
Intel Gaudi-Specific Metrics
Metric
Description
Unit
Labels
all_smi_gaudi_device_info
Device model and information
info
npu, instance, uuid, index
all_smi_gaudi_internal_name_info
Internal device name (e.g., HL-325L)
info
npu, instance, uuid, index, internal_name
all_smi_gaudi_driver_info
Habana driver version
info
npu, instance, uuid, index, version
all_smi_gaudi_aip_utilization_percent
AIP (AI Processor) utilization
percent
npu, instance, uuid, index
all_smi_gaudi_memory_used_bytes
HBM memory used
bytes
npu, instance, uuid, index
all_smi_gaudi_memory_total_bytes
HBM total memory
bytes
npu, instance, uuid, index
all_smi_gaudi_memory_utilization_percent
HBM memory utilization percentage
percent
npu, instance, uuid, index
all_smi_gaudi_power_draw_watts
Current power consumption
watts
npu, instance, uuid, index
all_smi_gaudi_power_max_watts
Maximum power limit
watts
npu, instance, uuid, index
all_smi_gaudi_power_utilization_percent
Power utilization percentage
percent
npu, instance, uuid, index
all_smi_gaudi_temperature_celsius
AIP temperature
celsius
npu, instance, uuid, index
Note: Intel Gaudi NPUs (Gaudi 1/2/3) are monitored via the hl-smi command-line tool running as a background process. Device names are automatically mapped from internal identifiers (e.g., HL-325L) to human-friendly names (e.g., Intel Gaudi 3 PCIe LP). The tool supports various form factors including PCIe, OAM, UBB, and HLS variants.
Google TPU Metrics
Basic NPU Metrics
Metric
Description
Unit
Labels
all_smi_gpu_utilization
TPU utilization percentage
percent
gpu_index, gpu_name
all_smi_gpu_memory_used_bytes
TPU memory used
bytes
gpu_index, gpu_name
all_smi_gpu_memory_total_bytes
TPU memory total
bytes
gpu_index, gpu_name
all_smi_gpu_temperature_celsius
TPU temperature
celsius
gpu_index, gpu_name
all_smi_gpu_power_consumption_watts
TPU power consumption
watts
gpu_index, gpu_name
all_smi_gpu_frequency_mhz
TPU clock frequency
MHz
gpu_index, gpu_name
all_smi_gpu_info
TPU device information
info
gpu_index, gpu_name, driver_version
TPU-Specific Metrics
Metric
Description
Unit
Labels
all_smi_tpu_utilization_percent
TPU duty cycle utilization
percent
npu, instance, uuid, index
all_smi_tpu_memory_used_bytes
TPU HBM memory used
bytes
npu, instance, uuid, index
all_smi_tpu_memory_total_bytes
TPU HBM memory total
bytes
npu, instance, uuid, index
all_smi_tpu_memory_utilization_percent
TPU HBM memory utilization percentage
percent
npu, instance, uuid, index
all_smi_tpu_chip_version_info
TPU chip version information
info
npu, instance, uuid, index, version
all_smi_tpu_accelerator_type_info
TPU accelerator type information
info
npu, instance, uuid, index, type
all_smi_tpu_core_count
Number of TPU cores
gauge
npu, instance, uuid, index
all_smi_tpu_tensorcore_count
Number of TensorCores per chip
gauge
npu, instance, uuid, index
all_smi_tpu_memory_type_info
TPU memory type (HBM2/HBM2e/HBM3e)
info
npu, instance, uuid, index, type
all_smi_tpu_runtime_version_info
TPU runtime/library version
info
npu, instance, uuid, index, version
all_smi_tpu_power_max_watts
TPU maximum power limit
watts
npu, instance, uuid, index
all_smi_tpu_hlo_queue_size
Number of pending HLO programs
gauge
npu, instance, uuid, index
all_smi_tpu_hlo_exec_mean_microseconds
HLO execution timing (mean)
µs
npu, instance, uuid, index
all_smi_tpu_hlo_exec_p50_microseconds
HLO execution timing (P50)
µs
npu, instance, uuid, index
all_smi_tpu_hlo_exec_p90_microseconds
HLO execution timing (P90)
µs
npu, instance, uuid, index
all_smi_tpu_hlo_exec_p95_microseconds
HLO execution timing (P95)
µs
npu, instance, uuid, index
all_smi_tpu_hlo_exec_p999_microseconds
HLO execution timing (P99.9)
µs
npu, instance, uuid, index
Note: Google Cloud TPUs (v2-v7/Ironwood) are monitored via the tpu-info command-line tool running in streaming mode. Metrics include duty cycle utilization, HBM memory tracking, and chip configuration details.
CPU Metrics (All Platforms)
Metric
Description
Unit
Labels
all_smi_cpu_utilization
CPU utilization percentage
percent
-
all_smi_cpu_socket_count
Number of CPU sockets
count
-
all_smi_cpu_core_count
Total number of CPU cores
count
-
all_smi_cpu_thread_count
Total number of CPU threads
count
-
all_smi_cpu_frequency_mhz
CPU frequency
MHz
-
all_smi_cpu_temperature_celsius
CPU temperature
celsius
-
all_smi_cpu_power_consumption_watts
CPU power consumption
watts
-
all_smi_cpu_socket_utilization
Per-socket CPU utilization
percent
socket
Apple Silicon CPU Specific Metrics
Metric
Description
Unit
Labels
all_smi_cpu_p_core_count
Number of performance cores
count
-
all_smi_cpu_e_core_count
Number of efficiency cores
count
-
all_smi_cpu_gpu_core_count
Number of integrated GPU cores
count
-
all_smi_cpu_p_core_utilization
P-core utilization percentage
percent
-
all_smi_cpu_e_core_utilization
E-core utilization percentage
percent
-
all_smi_cpu_p_cluster_frequency_mhz
P-cluster frequency
MHz
-
all_smi_cpu_e_cluster_frequency_mhz
E-cluster frequency
MHz
-
Memory Metrics (All Platforms)
Metric
Description
Unit
Labels
all_smi_memory_total_bytes
Total system memory
bytes
-
all_smi_memory_used_bytes
Used system memory
bytes
-
all_smi_memory_available_bytes
Available system memory
bytes
-
all_smi_memory_free_bytes
Free system memory
bytes
-
all_smi_memory_utilization
Memory utilization percentage
percent
-
all_smi_swap_total_bytes
Total swap space
bytes
-
all_smi_swap_used_bytes
Used swap space
bytes
-
all_smi_swap_free_bytes
Free swap space
bytes
-
Linux-Specific Memory Metrics
Metric
Description
Unit
Labels
all_smi_memory_buffers_bytes
Memory used for buffers
bytes
-
all_smi_memory_cached_bytes
Memory used for cache
bytes
-
Storage Metrics
Metric
Description
Unit
Labels
all_smi_disk_total_bytes
Total disk space
bytes
mount_point
all_smi_disk_available_bytes
Available disk space
bytes
mount_point
Note: Storage metrics exclude Docker bind mounts and are filtered to show only relevant filesystems.
Chassis/Node-Level Metrics
Chassis metrics provide visibility into system-wide power consumption, thermal conditions, and cooling status at the node level. These metrics aggregate information from CPU, GPU, ANE, and BMC sensors.
Common Chassis Metrics (All Platforms)
Metric
Description
Unit
Labels
all_smi_chassis_power_watts
Total chassis power consumption (CPU+GPU+ANE)
watts
hostname, instance
Apple Silicon Chassis Metrics
Metric
Description
Unit
Labels
all_smi_chassis_thermal_pressure_info
Thermal pressure level
info
hostname, instance, level
all_smi_chassis_cpu_power_watts
CPU power consumption
watts
hostname, instance
all_smi_chassis_gpu_power_watts
GPU power consumption
watts
hostname, instance
all_smi_chassis_ane_power_watts
ANE (Apple Neural Engine) power
watts
hostname, instance
Server Chassis Metrics (BMC-enabled Systems)
Metric
Description
Unit
Labels
all_smi_chassis_inlet_temperature_celsius
Chassis inlet temperature
celsius
hostname, instance
all_smi_chassis_outlet_temperature_celsius
Chassis outlet temperature
celsius
hostname, instance
all_smi_chassis_fan_speed_rpm
Fan speed
RPM
hostname, instance, fan_id, fan_name
Note: Chassis metrics provide a unified view of node-level power consumption and thermal conditions, useful for cluster-wide capacity planning and power monitoring.
Runtime Environment Metrics
Metric
Description
Unit
Labels
all_smi_runtime_environment
Current runtime environment (container or VM)
gauge
hostname, environment
all_smi_container_runtime_info
Container runtime environment information
gauge
hostname, runtime, container_id
all_smi_kubernetes_pod_info
Kubernetes pod information (K8s only)
gauge
hostname, pod_name, namespace
all_smi_virtualization_info
Virtualization environment information
gauge
hostname, vm_type, hypervisor
Runtime environment metrics are detected at startup and provide information about the execution context:
*Apple Silicon (M1/M2/M3/M4) GPU metrics do not include temperature (thermal pressure provided instead)
**Apple Silicon (M1/M2/M3/M4) provides enhanced P-core/E-core metrics and cluster frequencies
***Tenstorrent provides extensive hardware monitoring including multiple temperature sensors, health counters, and status registers
****Tenstorrent NPUs do not expose per-process GPU usage information
*****Rebellions NPUs do not expose per-process GPU usage information
******Furiosa NPUs do not expose per-process GPU usage information
*******Intel Gaudi NPUs do not expose per-process GPU usage information via hl-smi
********Google Cloud TPUs do not expose per-process GPU usage information via tpu-info
Example Prometheus Queries
Basic Monitoring
# Average GPU utilization across all GPUs
avg(all_smi_gpu_utilization)
# Memory usage percentage per GPU
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) * 100
# GPUs running above 80°C
all_smi_gpu_temperature_celsius > 80
Power Monitoring
# Total power consumption across all GPUs
sum(all_smi_gpu_power_consumption_watts)
# Power efficiency (utilization per watt)
all_smi_gpu_utilization / all_smi_gpu_power_consumption_watts
AMD GPU Specific
# AMD GPUs with high fan speed (potential cooling issues)
all_smi_gpu_fan_speed_rpm > 3000
# VRAM utilization percentage
(all_smi_gpu_memory_vram_bytes / all_smi_gpu_memory_total_bytes) * 100
# AMD GPUs approaching power cap
all_smi_gpu_power_consumption_watts / all_smi_amd_power_cap_watts > 0.9
# Memory bandwidth usage (VRAM + GTT)
all_smi_gpu_memory_vram_bytes + all_smi_gpu_memory_gtt_bytes
# AMD GPU thermal efficiency (utilization per degree)
all_smi_gpu_utilization / all_smi_gpu_temperature_celsius
Apple Silicon Specific
# P-core vs E-core utilization comparison
all_smi_cpu_p_core_utilization - all_smi_cpu_e_core_utilization
# ANE power consumption in watts
all_smi_ane_power_watts
Tenstorrent NPU Specific
# NPUs with high temperature on any sensor
max by (instance) ({
__name__=~"all_smi_tenstorrent_.*_temperature_celsius",
instance=~"tt.*"
}) > 80
# Power efficiency by board type
all_smi_gpu_utilization / on(instance) group_left(board_type)
(all_smi_tenstorrent_board_info * 0 + all_smi_gpu_power_consumption_watts)
# Throttling detection
all_smi_tenstorrent_throttler > 0
# Health monitoring - ARC processors not incrementing
rate(all_smi_tenstorrent_arc0_health[5m]) == 0
Rebellions NPU Specific
# NPUs in low performance state
all_smi_rebellions_performance_state > 0
# Devices with non-operational status
all_smi_rebellions_device_status != 1
# Power efficiency (TOPS per watt)
all_smi_rebellions_compute_tops / all_smi_gpu_power_consumption_watts
# Memory bandwidth saturation check
(all_smi_gpu_memory_used_bytes / all_smi_gpu_memory_total_bytes) > 0.9
Furiosa NPU Specific
# NPUs with unavailable cores
all_smi_furiosa_core_status == 0
# Average PE utilization across all cores
avg by (instance) (all_smi_furiosa_pe_utilization)
# NPUs with high error rates
rate(all_smi_furiosa_error_count[5m]) > 0.1
# Power governor not in OnDemand mode
all_smi_furiosa_power_governor_info{governor!="OnDemand"}
# Memory bandwidth bottleneck detection
all_smi_furiosa_memory_bandwidth_utilization > 80
Intel Gaudi NPU Specific
# NPUs with high AIP utilization
all_smi_gaudi_aip_utilization_percent > 80
# HBM memory utilization across cluster
avg by (instance) (all_smi_gaudi_memory_utilization_percent)
# NPUs approaching power limit
all_smi_gaudi_power_draw_watts / all_smi_gaudi_power_max_watts > 0.9
# Power efficiency (AIP utilization per watt)
all_smi_gaudi_aip_utilization_percent / all_smi_gaudi_power_draw_watts
# NPUs running hot (temperature > 70°C)
all_smi_gaudi_temperature_celsius > 70
# Total HBM memory usage across all Gaudi NPUs
sum(all_smi_gaudi_memory_used_bytes)
# Gaudi NPUs by device variant
count by (internal_name) (all_smi_gaudi_internal_name_info)
# Driver version consistency check
count by (version) (all_smi_gaudi_driver_info) > 1
Google TPU Specific
# TPU utilization across all chips
avg(all_smi_tpu_utilization_percent)
# HBM memory utilization percentage
all_smi_tpu_memory_utilization_percent
# Count TPUs by accelerator type
count by (type) (all_smi_tpu_accelerator_type_info)
# Monitor HLO queue size
all_smi_tpu_hlo_queue_size > 5
# Alert on high HLO execution latency
all_smi_tpu_hlo_exec_p90_microseconds > 1000000
Process Monitoring
# Top 5 GPU memory consumers
topk(5, all_smi_gpu_process_memory_bytes)
# Processes using more than 1GB GPU memory
all_smi_gpu_process_memory_bytes > 1073741824
Chassis/Node-Level Monitoring
# Total power consumption across all nodes
sum(all_smi_chassis_power_watts)
# Nodes with high power consumption (> 3000W)
all_smi_chassis_power_watts > 3000
# Power breakdown by component (Apple Silicon)
sum by (hostname) (all_smi_chassis_cpu_power_watts)
sum by (hostname) (all_smi_chassis_gpu_power_watts)
sum by (hostname) (all_smi_chassis_ane_power_watts)
# Nodes with non-nominal thermal pressure
all_smi_chassis_thermal_pressure_info{level!="Nominal"}
# Average chassis power per node
avg(all_smi_chassis_power_watts)
# Nodes with high inlet temperature
all_smi_chassis_inlet_temperature_celsius > 35
# Delta between inlet and outlet temperature (thermal dissipation)
all_smi_chassis_outlet_temperature_celsius - all_smi_chassis_inlet_temperature_celsius
# Fan speed monitoring
avg by (hostname) (all_smi_chassis_fan_speed_rpm)
Runtime Environment Monitoring
# All containers running in Kubernetes
all_smi_container_runtime_info{runtime="Kubernetes"}
# All instances running in AWS EC2
all_smi_virtualization_info{vm_type="AWS EC2"}
# Containers running in Backend.AI
all_smi_runtime_environment{environment="Backend.AI"}
# Group metrics by runtime environment
sum by (environment) (all_smi_gpu_utilization) * on(hostname) group_left(environment) all_smi_runtime_environment