Linux tool for observing CUDA Unified Memory behavior at runtime.
Focus: page fault migration, host↔device data movement (HtoD / DtoH), and memory pressure signals.
Runs controlled memory access patterns and collects runtime signals using CUDA, CUPTI, and NVML.
Exposes:
- GPU page faults
- Host ↔ Device migration (bytes_htod / bytes_dtoh)
- Fault distribution (steady vs burst)
- Migration overhead (MAF, BPF)
- Pressure and stability signals
- Residency vs migration behavior
Unified Memory simplifies development, but obscures key runtime behaviors:
- when data is actually moving
- how often faults occur
- whether memory is stable or thrashing
- why systems slow down or become unresponsive under load
This tool makes those behaviors observable.
One observed failure mode on unified memory platforms is systems becoming unresponsive under memory pressure instead of raising a CUDA OOM error.
This tool measures the memory conditions and signal patterns that precede that state.
The analyzer runs three passes:
- Fresh process (no GPU context)
- First-touch page faults (CPU → GPU)
- Measures actual migration path
- Prefetch to GPU
- Measures steady-state resident access
- Confirms residency stability
- Sustained access at highest safe ratio
- Detects thrashing, oscillation, and instability under load
Verdict is derived from combined signals, not a single metric.
MAF = (htod_bytes + dtoh_bytes) / total_pass_bytes
- ~1.0 → resident behavior
- 2–4 → moderate migration overhead
-
4 → heavy migration amplification
Note: On PCIe systems, elevated MAF alone does not indicate a failure condition.
High MAF values should be interpreted alongside fault rate, burst behavior, and residency signals.
bpf_htod = htod_bytes / gpu_page_faults
bpf_total = (htod_bytes + dtoh_bytes) / gpu_page_faults
Describes migration efficiency per fault.
fault_pressure_index = fault_rate_per_sec * fault_burst_ratio
Used to detect sustained pressure and burst behavior.
direction_ratio = total_htod / total_dtoh
Helps distinguish forward migration (CPU → GPU) from eviction / fallback (GPU → CPU).
- Per-pass summaries
- Ratio scaling behavior
- Final verdict
runs/um_YYYYMMDD_HHMMSS_GPU0_<uuid>/run.json
Includes:
- gpu_page_faults
- bytes_htod / bytes_dtoh
- maf
- fault_rate_per_sec
- fault_burst_ratio
- pressure_score
- transport (PCIe / NVLink / UMA)
- platform_caps
- stability indicators
| Verdict | Meaning |
|---|---|
HEALTHY |
Stable, resident behavior |
HEALTHY_LIMITED |
Stable but memory constrained |
MIGRATION_PRESSURE |
Elevated migration under load |
UM_THRASHING |
Active instability boundary |
DEGRADED |
Hardware-level issue |
CRITICAL |
Fatal / unrecoverable state |
Detected at runtime — no hardcoding.
| Paradigm | Meaning |
|---|---|
FULL_EXPLICIT |
Discrete GPU, PCIe migration |
FULL_HARDWARE_COHERENT |
Unified DRAM systems (DGX Spark GB10, Grace Blackwell) |
FULL_SOFTWARE_COHERENT |
OS-managed coherence |
Transport awareness:
- PCIe → migration cost dominated by interconnect bandwidth and latency
- NVLink → lower latency and higher bandwidth relative to PCIe
- UMA → shared memory pool; pressure replaces explicit migration
Requirements:
- Linux (x86_64 / aarch64)
- CUDA 12.x or 13.x
- NVML (
libnvidia-ml) - CUPTI (
libcupti) - C++17
Compile:
nvcc -O2 -std=c++17 \
-I/usr/local/cuda/include \
um_analyzer.cu \
-o um_analyzer \
-lcudart -lcupti -lnvidia-ml./um_analyzerOptions:
--device N test specific GPU index
--all-devices test all GPUs
--list-devices list available GPUs
--cupti-debug dump raw CUPTI UM records to stderr
--cupti-debug writes every CUPTI unified memory activity record to stderr —
counter kind, value, timestamps. Useful for understanding what signals are
present on a given platform, particularly on unified memory architectures
where behavior differs from discrete PCIe systems.
Core functionality is implemented using CUDA, CUPTI, and NVML APIs.
Behavior varies by platform and driver. This tool is actively being exercised across different systems to understand how Unified Memory signals behave under real workloads.
Findings are used to investigate failure modes such as:
- unexpected slowdowns
- memory pressure instability
- systems becoming unresponsive instead of reporting CUDA OOM
The goal is to make these behaviors observable and explainable.
Behavior varies across architectures, drivers, and system configurations.
Results from different systems are used to understand how Unified Memory signals behave in practice, particularly on newer platforms where migration and memory semantics differ from discrete PCIe systems.
Observed data is applied to investigate real failure modes (for example, systems becoming unresponsive instead of reporting CUDA OOM) and to identify underlying causes.
Logs and results from real workloads help improve interpretation and make these behaviors more explainable as results are collected across architectures and configurations.
| Platform | Paradigm | Status |
|---|---|---|
| Discrete GPU, PCIe | FULL_EXPLICIT |
Validated — multiple generations |
| DGX Spark GB10, coherent UMA | FULL_HARDWARE_COHERENT |
Detection implemented, runtime validation in progress |
spark-gpu-throttle-check
GPU power and thermal diagnostics for DGX Spark systems. Helps identify power
delivery limits (such as USB PD constraints) that may reduce clocks under
sustained load.
cuda-unified-memory-analyzer fork
Community fork by adi-sonusflow. Patches, CUPTI signal interpretation, and
execution model changes from this fork directly influenced the debugging approach
and runtime signal analysis used in this project.
parallelArchitect
Human-directed GPU engineering with AI assistance.
MIT