Skip to content

bug: persistent/leaked file handles to MIG devices prevents MIG destroys #276

@bergentruckung

Description

@bergentruckung

We're seeing an issue occur sporadically where we're unable to destroy MIGs (we've seen this happen for both CIs and GIs). The destroy operation fails saying that the device is in use, but there are no running processes (from nvidia-smi -q). We rely on DCGM to help with metrics. When we're in such a state on the GPU, restarting DCGM is the only way to destroy the MIG. That makes me think that this is because DCGM is still holding an open handle to the MIG.

I don't have a reproducer for this at the moment, but I'm trying to carve one out.

DCGM version: 4.4.2
Nvidia driver: 580.95.05
GPU: H200
OS: RHEL 8.10 (kernel: 4.18.0-553.89.1.el8_10.x86_64)

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions