We're seeing an issue occur sporadically where we're unable to destroy MIGs (we've seen this happen for both CIs and GIs). The destroy operation fails saying that the device is in use, but there are no running processes (from nvidia-smi -q). We rely on DCGM to help with metrics. When we're in such a state on the GPU, restarting DCGM is the only way to destroy the MIG. That makes me think that this is because DCGM is still holding an open handle to the MIG.
I don't have a reproducer for this at the moment, but I'm trying to carve one out.
DCGM version: 4.4.2
Nvidia driver: 580.95.05
GPU: H200
OS: RHEL 8.10 (kernel: 4.18.0-553.89.1.el8_10.x86_64)
Thanks in advance.