-
Notifications
You must be signed in to change notification settings - Fork 82
Open
Description
Seeing a reproducible nvbandwidth failure on DCGM 4.5.0 when running runlevel 3 (and likely 4) on H100x8 NVSwitch nodes.
Symptom:
sudo dcgmi diag -r 3-> nvbandwidth FAIL with:
"The memory copy utilization for the GPU: X is greater than 10%. This may affect the results of the nvbandwidth test."- Same node, same conditions:
dcgmi diag -r nvbandwidth-> PASS
This happens on idle hosts (no jobs / no background burn-in agents, nvidia-smi shows no processes). The GPU index that triggers varies (GPU2/GPU3/GPU5 etc).
Debug log suggests ordering/cooldown:
- Diagnostic plugin (GpuBurnWorker) runs, then nvbandwidth starts almost immediately.
Example:
09:54:30.303 Test nvbandwidth start
09:54:30.346 ERROR [[nvbandwidth]] MCUTIL > 10% for GPU 2 (NVBandwidthPlugin.cpp:275)
Extra Notes:
- DCGM 4.4.2 passes on the same node where 4.5.0 fails.
Questions:
- Is MCUTIL>10% supposed to hard-fail nvbandwidth in runlevel runs (message says "may affect")?
- If not, can nvbandwidth retry/wait for MCUTIL to settle, or can runlevel sequencing include a cooldown between Diagnostic and nvbandwidth?
Note:
- 4.5.0 binaries are available via NVIDIA CUDA Ubuntu repo (datacenter-gpu-manager 1:4.5.0-1), but I don't see a corresponding v4.5.0 tag/source snapshot in this GitHub repo yet, so I can’t check the plugin logic changes. I’m just reporting the observed behavior.
Debug snippets below.
2026-01-16 09:51:36.803 DEBUG [3279516:3280237] [[diagnostic]] computeTime: 35.27699112892151, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:37.076 DEBUG [3279516:3279516] [[diagnostic]] Thread 140545372317248 had !m_hasExited [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:232] [DcgmThread::Wait]
2026-01-16 09:51:37.092 DEBUG [3279516:3280408] [[diagnostic]] computeTime: 9.4824960231781, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:37.717 DEBUG [3279516:3280365] [[diagnostic]] computeTime: 19.259401082992554, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:37.962 DEBUG [3279516:3280408] [[diagnostic]] computeTime: 10.3525869846344, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:38.129 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542872622656 had !m_hasExited [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:232] [DcgmThread::Wait]
2026-01-16 09:51:38.488 DEBUG [3279516:3280324] [[diagnostic]] computeTime: 28.506460905075073, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:38.608 DEBUG [3279516:3280365] [[diagnostic]] computeTime: 20.1501681804657, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:39.182 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542864229952 had !m_hasExited [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:232] [DcgmThread::Wait]
2026-01-16 09:51:39.361 DEBUG [3279516:3280324] [[diagnostic]] computeTime: 29.379581928253174, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:39.370 DEBUG [3279516:3280376] [[diagnostic]] computeTime: 16.359700918197632, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:39.589 DEBUG [3279516:3280237] [[diagnostic]] computeTime: 38.063048124313354, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:40.226 DEBUG [3279516:3280331] [[diagnostic]] computeTime: 25.997331142425537, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:40.235 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542855837248 had !m_hasExited [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:232] [DcgmThread::Wait]
2026-01-16 09:51:40.455 DEBUG [3279516:3280237] [[diagnostic]] computeTime: 38.92840814590454, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:41.195 DEBUG [3279516:3280318] [[diagnostic]] computeTime: 35.44190001487732, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:41.288 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542637766208 had !m_hasExited [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:232] [DcgmThread::Wait]
2026-01-16 09:51:42.196 DEBUG [3279516:3280376] [[diagnostic]] computeTime: 19.185595989227295, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:42.265 DEBUG [3279516:3279965] [[diagnostic]] computeTime: 45.01917600631714, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:42.341 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542629373504 had !m_hasExited [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:232] [DcgmThread::Wait]
2026-01-16 09:51:43.068 DEBUG [3279516:3280331] [[diagnostic]] computeTime: 28.839499950408936, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:51:43.078 DEBUG [3279516:3280376] [[diagnostic]] computeTime: 20.06779384613037, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:50:02.419 INFO [3279516:3279516] Created thread named "" ID 1908102720 DcgmThread ptr 0x0x141733f0 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:119] [DcgmThread::Start]
2026-01-16 09:50:02.419 DEBUG [3279516:3279541] Thread handle 1908102720 running [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:305] [DcgmThread::RunInternal]
2026-01-16 09:50:02.419 DEBUG [3279516:3279541] HangDetectMonitor: Running [/builds/dcgm/dcgm/common/HangDetectMonitor.h:294] [HangDetectMonitor::run]
2026-01-16 09:50:02.419 DEBUG [3279516:3279541] HangDetectMonitor: Successfully enqueued periodic task check [/builds/dcgm/dcgm/common/HangDetectMonitor.h:311] [HangDetectMonitor::run]
2026-01-16 09:50:02.423 DEBUG [3279516:3279516] Test CUDA Main Library start [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:214] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.423 DEBUG [3279516:3279516] Set global parameter do_test -> libraries_cuda. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:02.423 DEBUG [3279516:3279516] Test CUDA Main Library had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:222] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.423 DEBUG [3279516:3279516] Test Denylist start [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:214] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.423 DEBUG [3279516:3279516] Set global parameter do_test -> denylist. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Test Denylist had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:222] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Test Environmental Variables start [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:214] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Set global parameter do_test -> env_variables. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable NSIGHT_CUDA_DEBUGGER not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable CUDA_INJECTION32_PATH not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable CUDA_INJECTION64_PATH not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable CUDA_AUTO_BOOST not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable CUDA_ENABLE_COREDUMP_ON_EXCEPTION not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable CUDA_COREDUMP_FILE not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable CUDA_DEVICE_WAITS_ON_EXCEPTION not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable CUDA_PROFILE not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable COMPUTE_PROFILE not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Env Variable OPENCL_PROFILE not found (GOOD) [/builds/dcgm/dcgm/nvvs/src/Software.cpp:951] [Software::checkForBadEnvVaribles]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Test Environmental Variables had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:222] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Test Fabric Manager start [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:214] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.427 DEBUG [3279516:3279516] Set global parameter do_test -> fabric_manager. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:02.429 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 0 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.430 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 1 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.430 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 2 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.431 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 3 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.431 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 4 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.432 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 5 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.432 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 6 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.437 DEBUG [3279516:3279516] Fabric manager successfully started for GPU 7 [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1003] [Software::checkFabricManager]
2026-01-16 09:50:02.437 DEBUG [3279516:3279516] Test Fabric Manager had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:222] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.437 DEBUG [3279516:3279516] Test Graphics Processes start [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:214] [SoftwarePluginFramework::Run]
2026-01-16 09:50:02.437 DEBUG [3279516:3279516] Set global parameter do_test -> graphics_processes. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:02.437 WARN [3279516:3279516] Error getting the graphics pids for GPU 0. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:598] [Software::checkForGraphicsProcesses]
2026-01-16 09:50:02.437 INFO [3279516:3279516] Test software: Error getting the graphics pids for GPU 0. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/PluginTest.cpp:221] [PluginTest::AddInfo]
2026-01-16 09:50:02.437 WARN [3279516:3279516] Error getting the graphics pids for GPU 1. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:598] [Software::checkForGraphicsProcesses]
2026-01-16 09:50:02.437 INFO [3279516:3279516] Test software: Error getting the graphics pids for GPU 1. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/PluginTest.cpp:221] [PluginTest::AddInfo]
2026-01-16 09:50:02.438 WARN [3279516:3279516] Error getting the graphics pids for GPU 2. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:598] [Software::checkForGraphicsProcesses]
2026-01-16 09:50:02.438 INFO [3279516:3279516] Test software: Error getting the graphics pids for GPU 2. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/PluginTest.cpp:221] [PluginTest::AddInfo]
2026-01-16 09:50:02.438 WARN [3279516:3279516] Error getting the graphics pids for GPU 3. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:598] [Software::checkForGraphicsProcesses]
2026-01-16 09:50:02.438 INFO [3279516:3279516] Test software: Error getting the graphics pids for GPU 3. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/PluginTest.cpp:221] [PluginTest::AddInfo]
2026-01-16 09:50:02.438 WARN [3279516:3279516] Error getting the graphics pids for GPU 4. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:598] [Software::checkForGraphicsProcesses]
2026-01-16 09:50:02.438 INFO [3279516:3279516] Test software: Error getting the graphics pids for GPU 4. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/PluginTest.cpp:221] [PluginTest::AddInfo]
2026-01-16 09:50:02.438 WARN [3279516:3279516] Error getting the graphics pids for GPU 5. Status = -32 skipping check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:598] [Software::checkForGraphicsProcesses]
2026-01-16 09:50:03.007 WARN [3279516:3279516] gpuId 5 returned status 0, value 9223372036854775792 for DCGM_FI_DEV_MEMORY_UNREPAIRABLE_FLAG. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:1049] [Software::checkUnrepairableMemory]
2026-01-16 09:50:03.007 DEBUG [3279516:3279516] Set global parameter do_test -> libraries_nvml. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:03.007 DEBUG [3279516:3279516] Test NVML Library had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:222] [SoftwarePluginFramework::Run]
2026-01-16 09:50:03.007 DEBUG [3279516:3279516] Test Page Retirement/Row Remap start [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:214] [SoftwarePluginFramework::Run]
2026-01-16 09:50:03.007 DEBUG [3279516:3279516] Set global parameter do_test -> page_retirement. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:03.008 WARN [3279516:3279516] gpuId 0 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.008 WARN [3279516:3279516] gpuId 0 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.008 WARN [3279516:3279516] gpuId 0 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.009 WARN [3279516:3279516] gpuId 1 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.009 WARN [3279516:3279516] gpuId 1 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.009 WARN [3279516:3279516] gpuId 1 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.010 WARN [3279516:3279516] gpuId 2 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.010 WARN [3279516:3279516] gpuId 2 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.010 WARN [3279516:3279516] gpuId 2 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.010 WARN [3279516:3279516] gpuId 3 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.011 WARN [3279516:3279516] gpuId 3 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.011 WARN [3279516:3279516] gpuId 3 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.011 WARN [3279516:3279516] gpuId 4 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.012 WARN [3279516:3279516] gpuId 4 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.012 WARN [3279516:3279516] gpuId 4 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.012 WARN [3279516:3279516] gpuId 5 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.012 WARN [3279516:3279516] gpuId 5 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.013 WARN [3279516:3279516] gpuId 5 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.013 WARN [3279516:3279516] gpuId 6 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.013 WARN [3279516:3279516] gpuId 6 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.014 WARN [3279516:3279516] gpuId 6 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.014 WARN [3279516:3279516] gpuId 7 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_PENDING. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:688] [Software::checkPageRetirement]
2026-01-16 09:50:03.014 WARN [3279516:3279516] gpuId 7 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_DBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:729] [Software::checkPageRetirement]
2026-01-16 09:50:03.014 WARN [3279516:3279516] gpuId 7 returned status 0, value 9223372036854775794for DCGM_FI_DEV_RETIRED_SBE. Skipping this check. [/builds/dcgm/dcgm/nvvs/src/Software.cpp:750] [Software::checkPageRetirement]
2026-01-16 09:50:03.017 DEBUG [3279516:3279516] Test Page Retirement/Row Remap had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:222] [SoftwarePluginFramework::Run]
2026-01-16 09:50:03.017 DEBUG [3279516:3279516] Test Permissions and OS-related Blocks start [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:214] [SoftwarePluginFramework::Run]
2026-01-16 09:50:03.017 DEBUG [3279516:3279516] Set global parameter do_test -> permissions. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:50:03.017 DEBUG [3279516:3279516] Test Permissions and OS-related Blocks had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/SoftwarePluginFramework.cpp:222] [SoftwarePluginFramework::Run]
2026-01-16 09:54:28.730 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542629373504 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:28.730 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542620980800 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:29.586 DEBUG [3279516:3280408] [[diagnostic]] computeTime: 181.9764530658722, InitBufferTime: 0 [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:1421] [GpuBurnWorker::run]
2026-01-16 09:54:29.586 DEBUG [3279516:3280408] [[diagnostic]] Thread id 671082048 stopped [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:314] [DcgmThread::RunInternal]
2026-01-16 09:54:29.587 DEBUG [3279516:3279516] [[diagnostic]] Thread 140540590941760 had m_alreadyJoined 0 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:29.587 DEBUG [3279516:3279516] [[diagnostic]] Done waiting for threads. should_stop=0, failedEarly=false [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:637] [GpuBurnPlugin::RunTest]
2026-01-16 09:54:29.587 DEBUG [3279516:3279516] [[diagnostic]] Thread 140545372317248 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:29.669 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542872622656 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:29.752 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542864229952 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:29.835 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542855837248 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:29.917 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542637766208 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:30.000 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542629373504 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:30.084 DEBUG [3279516:3279516] [[diagnostic]] Thread 140542620980800 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:30.167 DEBUG [3279516:3279516] [[diagnostic]] Thread 140540590941760 had m_alreadyJoined 1 [/builds/dcgm/dcgm/common/DcgmThread/DcgmThread.cpp:236] [DcgmThread::Wait]
2026-01-16 09:54:30.249 DEBUG [3279516:3279516] [[diagnostic]] Go::RunTest: result=true, should_stop=false [/builds/dcgm/dcgm/nvvs/plugin_src/diagnostic/DiagnosticPlugin.cpp:411] [GpuBurnPlugin::Go]
2026-01-16 09:54:30.249 DEBUG [3279516:3279516] Unregistering task 3279516/3279516 [/builds/dcgm/dcgm/common/HangDetectMonitor.cpp:416] [HangDetectMonitor::RemoveMonitoredTask]
2026-01-16 09:54:30.249 DEBUG [3279516:3279516] Deleted fingerprint for task 3279516 of pid 3279516 [/builds/dcgm/dcgm/common/FingerprintStore.cpp:124] [FingerprintStore::Delete]
2026-01-16 09:54:30.265 DEBUG [3279516:3279516] Called; errors = 0, info = 24, results = 8 [/builds/dcgm/dcgm/nvvs/src/PluginLibTest.cpp:228] [PluginLibTest::PopulateEntityResults]
2026-01-16 09:54:30.265 DEBUG [3279516:3279516] Plugin returned unknown type of aux data. Expected JSON_VALUE_AUX_DATA_TYPE (1), got 0 [/builds/dcgm/dcgm/nvvs/src/PluginLibTest.cpp:280] [PluginLibTest::PopulateEntityResults]
2026-01-16 09:54:30.267 INFO [3279516:3279516] Checking for common errors [/builds/dcgm/dcgm/nvvs/src/PluginCoreFunctionality.cpp:157] [PluginCoreFunctionality::PluginEnded]
2026-01-16 09:54:30.303 DEBUG [3279516:3279516] Test diagnostic had result 0. Configless is true [/builds/dcgm/dcgm/nvvs/src/TestFramework.cpp:957] [TestFramework::GoList]
2026-01-16 09:54:30.303 DEBUG [3279516:3279516] Writing diag info for test diagnostic [/builds/dcgm/dcgm/nvvs/src/TestFramework.cpp:855] [TestFramework::WriteDiagStatusToChannel]
2026-01-16 09:54:30.303 DEBUG [3279516:3279516] Test nvbandwidth start [/builds/dcgm/dcgm/nvvs/src/TestFramework.cpp:926] [TestFramework::GoList]
2026-01-16 09:54:30.344 DEBUG [3279516:3279516] Unable to retrieve fingerprint for task 3279516 of pid 3279516 [/builds/dcgm/dcgm/common/FingerprintStore.cpp:92] [FingerprintStore::Retrieve]
2026-01-16 09:54:30.344 DEBUG [3279516:3279516] Set global parameter is_allowed -> True. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:54:30.344 DEBUG [3279516:3279516] Set global parameter logfile_type -> 0.000000. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:54:30.346 ERROR [3279516:3279516] [[nvbandwidth]] The memory copy utilization for the GPU: 2 is greater than 10%. This may affect the results of the nvbandwidth test. [/builds/dcgm/dcgm/nvvs/plugin_src/nvbandwidth/NVBandwidthPlugin.cpp:275] [DcgmNs::Nvvs::Plugins::NVBandwidth::NVBandwidthPlugin::Go]
2026-01-16 09:54:30.346 WARN [3279516:3279516] Test nvbandwidth: The memory copy utilization for the GPU: 2 is greater than 10%. This may affect the results of the nvbandwidth test. (grpId:1, entityId:2) [/builds/dcgm/dcgm/nvvs/src/PluginTest.cpp:179] [PluginTest::AddError]
2026-01-16 09:54:30.346 DEBUG [3279516:3279516] Unregistering task 3279516/3279516 [/builds/dcgm/dcgm/common/HangDetectMonitor.cpp:416] [HangDetectMonitor::RemoveMonitoredTask]
2026-01-16 09:54:30.346 DEBUG [3279516:3279516] Deleted fingerprint for task 3279516 of pid 3279516 [/builds/dcgm/dcgm/common/FingerprintStore.cpp:124] [FingerprintStore::Delete]
2026-01-16 09:54:30.354 DEBUG [3279516:3279516] Called; errors = 1, info = 0, results = 8 [/builds/dcgm/dcgm/nvvs/src/PluginLibTest.cpp:228] [PluginLibTest::PopulateEntityResults]
2026-01-16 09:54:30.354 DEBUG [3279516:3279516] Plugin returned unknown type of aux data. Expected JSON_VALUE_AUX_DATA_TYPE (1), got 0 [/builds/dcgm/dcgm/nvvs/src/PluginLibTest.cpp:280] [PluginLibTest::PopulateEntityResults]
2026-01-16 09:54:30.355 INFO [3279516:3279516] Checking for common errors [/builds/dcgm/dcgm/nvvs/src/PluginCoreFunctionality.cpp:157] [PluginCoreFunctionality::PluginEnded]
2026-01-16 09:54:30.534 DEBUG [3279516:3279516] Test nvbandwidth had result 2. Configless is true [/builds/dcgm/dcgm/nvvs/src/TestFramework.cpp:957] [TestFramework::GoList]
2026-01-16 09:54:30.534 DEBUG [3279516:3279516] Writing diag info for test nvbandwidth [/builds/dcgm/dcgm/nvvs/src/TestFramework.cpp:855] [TestFramework::WriteDiagStatusToChannel]
2026-01-16 09:54:30.535 DEBUG [3279516:3279516] Test pcie start [/builds/dcgm/dcgm/nvvs/src/TestFramework.cpp:926] [TestFramework::GoList]
2026-01-16 09:54:31.116 DEBUG [3279516:3279516] Unable to retrieve fingerprint for task 3279516 of pid 3279516 [/builds/dcgm/dcgm/common/FingerprintStore.cpp:92] [FingerprintStore::Retrieve]
2026-01-16 09:54:31.130 DEBUG [3279516:3279516] [[pcie]] Using copy sizes for GPUs prior to Blackwell [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Pcie.cpp:792] [BusGrind::SetCopySizes]
2026-01-16 09:54:31.130 DEBUG [3279516:3279516] Set global parameter is_allowed -> True. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:54:31.130 DEBUG [3279516:3279516] Set global parameter logfile_type -> 0.000000. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:54:31.130 DEBUG [3279516:3279516] Set global parameter target_stress -> 29250.000000. st 0 [/builds/dcgm/dcgm/nvvs/src/TestParameters.cpp:284] [TestParameters::SetString]
2026-01-16 09:54:31.136 DEBUG [3279516:3279516] [[pcie]] Testing GPUs 0 and 1 for p2p issues. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:235] [Brokenp2p::RunTest]
2026-01-16 09:54:31.511 DEBUG [3279516:3279516] [[pcie]] Memory device 0 and p2p writer device 1 passed. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:261] [Brokenp2p::RunTest]
2026-01-16 09:54:31.511 DEBUG [3279516:3279516] [[pcie]] Testing GPUs 0 and 2 for p2p issues. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:235] [Brokenp2p::RunTest]
2026-01-16 09:54:32.018 DEBUG [3279516:3279516] [[pcie]] Memory device 0 and p2p writer device 2 passed. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:261] [Brokenp2p::RunTest]
2026-01-16 09:54:32.018 DEBUG [3279516:3279516] [[pcie]] Testing GPUs 0 and 3 for p2p issues. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:235] [Brokenp2p::RunTest]
2026-01-16 09:54:32.516 DEBUG [3279516:3279516] [[pcie]] Memory device 0 and p2p writer device 3 passed. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:261] [Brokenp2p::RunTest]
2026-01-16 09:54:32.516 DEBUG [3279516:3279516] [[pcie]] Testing GPUs 0 and 4 for p2p issues. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:235] [Brokenp2p::RunTest]
2026-01-16 09:54:33.015 DEBUG [3279516:3279516] [[pcie]] Memory device 0 and p2p writer device 4 passed. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:261] [Brokenp2p::RunTest]
2026-01-16 09:54:33.015 DEBUG [3279516:3279516] [[pcie]] Testing GPUs 0 and 5 for p2p issues. [/builds/dcgm/dcgm/nvvs/plugin_src/pcie/Brokenp2p.cpp:235] [Brokenp2p::RunTest]Metadata
Metadata
Assignees
Labels
No labels