fix: preserve per-device rows for multi-GPU PID attribution#123
Open
wwoodsTM wants to merge 1 commit intolablup:mainfrom
Open
fix: preserve per-device rows for multi-GPU PID attribution#123wwoodsTM wants to merge 1 commit intolablup:mainfrom
wwoodsTM wants to merge 1 commit intolablup:mainfrom
Conversation
Two bugs caused incorrect process reporting on multi-GPU systems:
1. NVML collection (nvidia.rs): A PID appearing as both a compute and
graphics process, or on multiple devices, was mishandled. The
!gpu_pids.contains() guard on graphics processes silently dropped
entries when the PID was already seen on any device as a compute
process. Additionally, using a flat Vec allowed the same (PID,
device) pair reported by both compute and graphics lists to be
double-counted. Fix: collect into a HashMap<(pid, device_uuid)> and
merge overlapping entries with max() to avoid inflation.
2. Shared merge logic (process_list.rs): merge_gpu_processes() keyed
only by PID, collapsing a process spanning multiple GPUs into a
single row and losing per-device memory and utilization attribution.
Fix: re-key by (PID, device_uuid) to preserve one row per process
per device.
Changes:
- nvidia.rs:
- Replace Vec with HashMap<(pid, device_uuid)> in
get_gpu_processes_nvml()
- Add merge_nvml_process_entry() to coalesce overlapping
compute/graphics rows per (pid, device) using max()
- Remove incorrect !gpu_pids.contains() guard that dropped valid
multi-device graphics entries
- Add unit tests for NVML-level merge behavior
- process_list.rs:
- Re-key merge_gpu_processes() by (PID, device_uuid)
- Change signature from &mut Vec to Vec -> Vec
- Add unit tests for multi-GPU, dedup, non-GPU, and orphan cases
- nvidia_jetson.rs, tenstorrent.rs, local_collector.rs:
- Update call sites for new signature
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First off, I really appreciate your project's GPU agnostic approach and appreciate you making this open-source! I am submitting this PR because I noticed some issues with per-process reporting for multi-GPU setups and this attempts to address these.
Summary
(PID, device)usingmax()to avoid double-countingmerge_gpu_processes()keyed only by PID, losing per-device memory and utilization attribution(PID, device_uuid)to preserve one row per process per deviceTests
cargo test— all 46 tests pass (including 6 new unit tests)cargo fmt --check— cleancargo clippy -- -D warnings— cleanMore info
In my original usage/testing with 2 GPUs I noticed that processes like llama.cpp's llama-server that utilize both of my GPUs would only show up as only one entry tied to one single GPU. I saw this same behavior both in the terminal "local" GUI and through the API with the
metricsendpoint. I verified the correct info for the splits through manually running nvidia-smi and by verifying through llama-server's own output and both clearly reported VRAM usage split across both GPUs for the single llama-server PID.Anyway, I believe I traced the issues I described above to two separate bugs, one in the NVML section (though not in the nvidia-smi fallback), where apparently only one data point for VRAM usage is allowed to map to a PID (and anything else gets dropped), and then another bug tied to the merging of the info for the UI. From what I can tell, the mapping issue I found for NVML does not exist anywhere else as far as vendor specific parts.