GCM is a set of tools used to do at-scale monitoring for HPC (High-Performance Computing) clusters, it powers Meta FAIR (Fundamental AI Research) AI workloads across hundreds of thousands of GPUs at ...