Skip to content

fix: preserve per-device rows for multi-GPU PID attribution#123

Open
wwoodsTM wants to merge 1 commit intolablup:mainfrom
wwoodsTM:bugfix/multi-gpu-processes
Open

fix: preserve per-device rows for multi-GPU PID attribution#123
wwoodsTM wants to merge 1 commit intolablup:mainfrom
wwoodsTM:bugfix/multi-gpu-processes

Conversation

@wwoodsTM
Copy link

First off, I really appreciate your project's GPU agnostic approach and appreciate you making this open-source! I am submitting this PR because I noticed some issues with per-process reporting for multi-GPU setups and this attempts to address these.

Summary

  • Deduplicate overlapping NVML compute/graphics rows for the same (PID, device) using max() to avoid double-counting
  • A single PID spanning multiple GPUs was collapsed into one row because merge_gpu_processes() keyed only by PID, losing per-device memory and utilization attribution
  • Re-key by (PID, device_uuid) to preserve one row per process per device

Tests

  • cargo test — all 46 tests pass (including 6 new unit tests)
  • cargo fmt --check — clean
  • cargo clippy -- -D warnings — clean
  • Verified correct multi-GPU attribution via both API and GUI with llama-server spanning two GPUs

More info

In my original usage/testing with 2 GPUs I noticed that processes like llama.cpp's llama-server that utilize both of my GPUs would only show up as only one entry tied to one single GPU. I saw this same behavior both in the terminal "local" GUI and through the API with the metrics endpoint. I verified the correct info for the splits through manually running nvidia-smi and by verifying through llama-server's own output and both clearly reported VRAM usage split across both GPUs for the single llama-server PID.

Anyway, I believe I traced the issues I described above to two separate bugs, one in the NVML section (though not in the nvidia-smi fallback), where apparently only one data point for VRAM usage is allowed to map to a PID (and anything else gets dropped), and then another bug tied to the merging of the info for the UI. From what I can tell, the mapping issue I found for NVML does not exist anywhere else as far as vendor specific parts.

Two bugs caused incorrect process reporting on multi-GPU systems:

1. NVML collection (nvidia.rs): A PID appearing as both a compute and
   graphics process, or on multiple devices, was mishandled. The
   !gpu_pids.contains() guard on graphics processes silently dropped
   entries when the PID was already seen on any device as a compute
   process. Additionally, using a flat Vec allowed the same (PID,
   device) pair reported by both compute and graphics lists to be
   double-counted. Fix: collect into a HashMap<(pid, device_uuid)> and
   merge overlapping entries with max() to avoid inflation.

2. Shared merge logic (process_list.rs): merge_gpu_processes() keyed
   only by PID, collapsing a process spanning multiple GPUs into a
   single row and losing per-device memory and utilization attribution.
   Fix: re-key by (PID, device_uuid) to preserve one row per process
   per device.

Changes:
- nvidia.rs:
  - Replace Vec with HashMap<(pid, device_uuid)> in
    get_gpu_processes_nvml()
  - Add merge_nvml_process_entry() to coalesce overlapping
    compute/graphics rows per (pid, device) using max()
  - Remove incorrect !gpu_pids.contains() guard that dropped valid
    multi-device graphics entries
  - Add unit tests for NVML-level merge behavior
- process_list.rs:
  - Re-key merge_gpu_processes() by (PID, device_uuid)
  - Change signature from &mut Vec to Vec -> Vec
  - Add unit tests for multi-GPU, dedup, non-GPU, and orphan cases
- nvidia_jetson.rs, tenstorrent.rs, local_collector.rs:
  - Update call sites for new signature
@cla-assistant
Copy link

cla-assistant bot commented Feb 17, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant