AMD Device Metrics Exporter enables real-time collection of telemetry data in Prometheus format from AMD GPUs in HPC and AI environments. It provides comprehensive metrics including temperature, utilization, memory usage, power consumption, and more.
- Prometheus-compatible metrics endpoint
- Rich GPU telemetry data including:
- Temperature monitoring
- Utilization metrics
- Memory usage statistics
- Power consumption data
- PCIe bandwidth metrics
- Performance metrics
- Kubernetes integration via Helm chart
- Slurm integration support
- Configurable service ports
- Container-based deployment
For detailed documentation including installation guides, configuration options, and metric descriptions, see the documentation.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.