-
Notifications
You must be signed in to change notification settings - Fork 82
Open
Description
Why would dcgmi not report GPU Max Memory Bytes only on L4 GPUs when introspecting through process stats (I have is successfully reporting on A10 and T4 GPUs in otherwise the exact same config)
Successfully retrieved process info for PID: 71081. Process ran on 1 GPUs.
+------------------------------------------------------------------------------+
| GPU ID: 2 |
+====================================+=========================================+
|----- Execution Stats ------------+-----------------------------------------|
| Start Time * | Fri Dec 12 00:11:25 2025 |
| End Time * | Still Running |
| Total Execution Time (sec) * | Still Running |
| No. of Conflicting Processes * | 0 |
+----- Performance Stats ----------+-----------------------------------------+
| Energy Consumed (Joules) | 27154 |
| Max GPU Memory Used (bytes) * | 0 |
| SM Clock (MHz) | Avg: 1602, Max: 2040, Min: 1440 |
| Memory Clock (MHz) | Avg: 6251, Max: 6251, Min: 6251 |
| SM Utilization (%) | Avg: 91, Max: 100, Min: 70 |
| Memory Utilization (%) | Avg: 23, Max: 100, Min: 0 |
| PCIe Rx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
| PCIe Tx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
+----- Event Stats ----------------+-----------------------------------------+
| Double Bit ECC Errors | 0 |
| PCIe Replay Warnings | 0 |
| Critical XID Errors | 0 |
+----- Slowdown Stats -------------+-----------------------------------------+
| Due to - Power (%) | 3.02787e-06 |
| - Thermal (%) | 0 |
| - Reliability (%) | 0 |
| - Board Limit (%) | 0 |
| - Low Utilization (%) | 0 |
| - Sync Boost (%) | 0 |
+----- Process Utilization --------+-----------------------------------------+
| PID | 71081 |
| Avg SM Utilization (%) | 90 |
| Avg Memory Utilization (%) | 32 |
+----- Overall Health -------------+-----------------------------------------+
| Overall Health | Healthy |
+------------------------------------+-----------------------------------------+
I originally thought this was because it was using HMM, however I was able to test this locally on an L4 GPU with Ubuntu 22.04/6.8 Kernel and it is working.
It is not working in our Kubernetes environment which is using AL2023/6.1 kernel and same driver version.
I've not been able to find any logs or errors that might indicate what is different/missing.
Both are on the same driver version:
Fri Dec 12 00:24:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:38:00.0 Off | 0 |
| N/A 35C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:3A:00.0 Off | 0 |
| N/A 33C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:3C:00.0 Off | 0 |
| N/A 75C P0 67W / 72W | 1732MiB / 23034MiB | 90% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:3E:00.0 Off | 0 |
| N/A 32C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 2 N/A N/A 71081 C /app/cuda_test_workload 1724MiB |
+-----------------------------------------------------------------------------------------+
Metadata
Metadata
Assignees
Labels
No labels