Skip to content

No GPU Max Memory Bytes reported on L4 GPU #268

@sidewinder12s

Description

@sidewinder12s

Why would dcgmi not report GPU Max Memory Bytes only on L4 GPUs when introspecting through process stats (I have is successfully reporting on A10 and T4 GPUs in otherwise the exact same config)

Successfully retrieved process info for PID: 71081. Process ran on 1 GPUs.
+------------------------------------------------------------------------------+
| GPU ID: 2                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                     *   | Fri Dec 12 00:11:25 2025                |
| End Time                       *   | Still Running                           |
| Total Execution Time (sec)     *   | Still Running                           |
| No. of Conflicting Processes   *   | 0                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 27154                                   |
| Max GPU Memory Used (bytes)    *   | 0                                       |
| SM Clock (MHz)                     | Avg: 1602, Max: 2040, Min: 1440         |
| Memory Clock (MHz)                 | Avg: 6251, Max: 6251, Min: 6251         |
| SM Utilization (%)                 | Avg: 91, Max: 100, Min: 70              |
| Memory Utilization (%)             | Avg: 23, Max: 100, Min: 0               |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 3.02787e-06                             |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | 0                                       |
|        - Board Limit (%)           | 0                                       |
|        - Low Utilization (%)       | 0                                       |
|        - Sync Boost (%)            | 0                                       |
+-----  Process Utilization  --------+-----------------------------------------+
| PID                                | 71081                                   |
|     Avg SM Utilization (%)         | 90                                      |
|     Avg Memory Utilization (%)     | 32                                      |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

I originally thought this was because it was using HMM, however I was able to test this locally on an L4 GPU with Ubuntu 22.04/6.8 Kernel and it is working.

It is not working in our Kubernetes environment which is using AL2023/6.1 kernel and same driver version.

I've not been able to find any logs or errors that might indicate what is different/missing.

Both are on the same driver version:

Fri Dec 12 00:24:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:38:00.0 Off |                    0 |
| N/A   35C    P8             16W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      On  |   00000000:3A:00.0 Off |                    0 |
| N/A   33C    P8             16W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      On  |   00000000:3C:00.0 Off |                    0 |
| N/A   75C    P0             67W /   72W |    1732MiB /  23034MiB |     90%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      On  |   00000000:3E:00.0 Off |                    0 |
| N/A   32C    P8             16W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    2   N/A  N/A           71081      C   /app/cuda_test_workload                1724MiB |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions