GCM is a set of tools used to do at-scale monitoring for HPC (High-Performance Computing) clusters, it powers Meta FAIR (Fundamental AI Research) AI workloads across hundreds of thousands of GPUs at Meta.
GCM is a monorepo with the following components:
- Monitoring: Collects cluster statistics from the Slurm workload scheduler, providing visibility into job performance and resource utilization.
- Health Checks: Verifies the proper functioning of hardware, software, network, storage, and services throughout the job lifecycle.
- Telemetry Processor / GPU Metrics: Enhances OpenTelemetry data by correlating telemetry with Slurm metadata, enabling attribution of metrics (e.g., GPU utilization) to specific jobs and users.
For more information, check our documentation.
Each component has its own README with detailed guides:
- Integration with more GPU types (AMD, Intel, Custom Accelerators)
- Support for additional schedulers beyond Slurm
- Additional Slurm related Monitoring
- Support for new exporters
- Adding support for Slurm REST API querying
- Adding support for new Health Checks
- Distribution via Docker Images and Helm Charts
Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please read the full text so that you can understand what actions will and will not be tolerated.
GPU Cluster Monitoring is actively maintained by Lucca Bertoncini, Caleb Ho, Apostolos Kokolis, Liao Hu, Thanh Nguyen, Billy Campoli with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): Jörg Doku, Vivian Peng, Parth Malani, Kalyan Saladi, Shubho Sengupta, Leo Huang, Robert Vincent, Max Wang, Sujit Verma, Teng Li, James Taylor, Xiaodong Ma, Chris Henry, Jakob Johnson, Kareem Sakher, Abinesh Ramakrishnan, Nabib Ahmed, Yong Li, Junjie Qian, David Watson, Guanyu Wu, Jaromir Latal, Samuel Doud, Yidi Wu, Xinyuan Zhang, Neha Saxena.
Feel free to contribute and add your name!
Each GCM component has its own lincense.
/gcm is licensed under the MIT License.
/shelper is licensed under the MIT License.
/slurmprocessor is licensed under the Apache 2.0 License.
Remaining files are licensed under the MIT License.