fix: preserve per-device rows for multi-GPU PID attribution by wwoodsTM · Pull Request #123 · lablup/all-smi

wwoodsTM · 2026-02-17T01:02:52Z

First off, I really appreciate your project's GPU agnostic approach and appreciate you making this open-source! I am submitting this PR because I noticed some issues with per-process reporting for multi-GPU setups and this attempts to address these.

Summary

Deduplicate overlapping NVML compute/graphics rows for the same (PID, device) using max() to avoid double-counting
A single PID spanning multiple GPUs was collapsed into one row because merge_gpu_processes() keyed only by PID, losing per-device memory and utilization attribution
Re-key by (PID, device_uuid) to preserve one row per process per device

Tests

cargo test — all 46 tests pass (including 6 new unit tests)
cargo fmt --check — clean
cargo clippy -- -D warnings — clean
Verified correct multi-GPU attribution via both API and GUI with llama-server spanning two GPUs

More info

In my original usage/testing with 2 GPUs I noticed that processes like llama.cpp's llama-server that utilize both of my GPUs would only show up as only one entry tied to one single GPU. I saw this same behavior both in the terminal "local" GUI and through the API with the metrics endpoint. I verified the correct info for the splits through manually running nvidia-smi and by verifying through llama-server's own output and both clearly reported VRAM usage split across both GPUs for the single llama-server PID.

Anyway, I believe I traced the issues I described above to two separate bugs, one in the NVML section (though not in the nvidia-smi fallback), where apparently only one data point for VRAM usage is allowed to map to a PID (and anything else gets dropped), and then another bug tied to the merging of the info for the UI. From what I can tell, the mapping issue I found for NVML does not exist anywhere else as far as vendor specific parts.

Two bugs caused incorrect process reporting on multi-GPU systems: 1. NVML collection (nvidia.rs): A PID appearing as both a compute and graphics process, or on multiple devices, was mishandled. The !gpu_pids.contains() guard on graphics processes silently dropped entries when the PID was already seen on any device as a compute process. Additionally, using a flat Vec allowed the same (PID, device) pair reported by both compute and graphics lists to be double-counted. Fix: collect into a HashMap<(pid, device_uuid)> and merge overlapping entries with max() to avoid inflation. 2. Shared merge logic (process_list.rs): merge_gpu_processes() keyed only by PID, collapsing a process spanning multiple GPUs into a single row and losing per-device memory and utilization attribution. Fix: re-key by (PID, device_uuid) to preserve one row per process per device. Changes: - nvidia.rs: - Replace Vec with HashMap<(pid, device_uuid)> in get_gpu_processes_nvml() - Add merge_nvml_process_entry() to coalesce overlapping compute/graphics rows per (pid, device) using max() - Remove incorrect !gpu_pids.contains() guard that dropped valid multi-device graphics entries - Add unit tests for NVML-level merge behavior - process_list.rs: - Re-key merge_gpu_processes() by (PID, device_uuid) - Change signature from &mut Vec to Vec -> Vec - Add unit tests for multi-GPU, dedup, non-GPU, and orphan cases - nvidia_jetson.rs, tenstorrent.rs, local_collector.rs: - Update call sites for new signature

cla-assistant · 2026-02-17T01:03:06Z

All committers have signed the CLA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve per-device rows for multi-GPU PID attribution#123

fix: preserve per-device rows for multi-GPU PID attribution#123
wwoodsTM wants to merge 1 commit intolablup:mainfrom
wwoodsTM:bugfix/multi-gpu-processes

wwoodsTM commented Feb 17, 2026

Uh oh!

cla-assistant bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wwoodsTM commented Feb 17, 2026

Summary

Tests

More info

Uh oh!

cla-assistant bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cla-assistant bot commented Feb 17, 2026 •

edited

Loading