-
Notifications
You must be signed in to change notification settings - Fork 59
Open
Labels
Description
Problem Description
amd-smi metric --mem reports USED_VRAM that does not update when memory is allocated or freed. The value remains constant regardless of GPU memory operations. In contrast, hipMemGetInfo() correctly reports real-time memory usage.
Impact
- torch.cuda.device_memory_used() returns inaccurate data on ROCm, which relies on
amdsmi_get_gpu_vram_usage(). - Any other memory monitoring tools using amd-smi to track real-time GPU memory usage.
Operating System
Linux (kernel 6.8.0)
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD Instinct MI355X (gfx950)
ROCm Version
7.0.0
ROCm Component
amdsmi
Steps to Reproduce
import subprocess
import torch
def get_amd_smi_vram_mb():
result = subprocess.run(['amd-smi', 'metric', '--gpu', '0', '--mem'],
capture_output=True, text=True)
for line in result.stdout.split('\n'):
if 'USED_VRAM' in line and 'VISIBLE' not in line:
return int(line.split(':')[1].strip().replace(' MB', ''))
return -1
def get_hip_mem_used_mb():
free, total = torch.cuda.mem_get_info(0)
return (total - free) // (1024 * 1024)
# Initial state
amd_initial = get_amd_smi_vram_mb()
hip_initial = get_hip_mem_used_mb()
print(f"Initial: amd-smi={amd_initial} MB, hipMemGetInfo={hip_initial} MB")
# Allocate 2GB
tensor = torch.zeros(2_000_000_000, dtype=torch.uint8, device='cuda:0')
torch.cuda.synchronize()
amd_after = get_amd_smi_vram_mb()
hip_after = get_hip_mem_used_mb()
print(f"After 2GB alloc: amd-smi={amd_after} MB (Δ={amd_after-amd_initial}), hipMemGetInfo={hip_after} MB (Δ={hip_after-hip_initial})")
# Free
del tensor
torch.cuda.empty_cache()
amd_free = get_amd_smi_vram_mb()
hip_free = get_hip_mem_used_mb()
print(f"After free: amd-smi={amd_free} MB, hipMemGetInfo={hip_free} MB")
Expected Output
- Initial: amd-smi=300 MB, hipMemGetInfo=634 MB
- After 2GB alloc: amd-smi=2348 MB (Δ=2048), hipMemGetInfo=2692 MB (Δ=2058)
- After free: amd-smi=300 MB, hipMemGetInfo=784 MB
Actual Output
- Initial: amd-smi=283 MB, hipMemGetInfo=634 MB
- After 2GB alloc: amd-smi=284 MB (Δ=1), hipMemGetInfo=2692 MB (Δ=2058)
- After free: amd-smi=284 MB, hipMemGetInfo=784 MB
amd-smi shows Δ=1 MB while hipMemGetInfo correctly shows Δ=2058 MB
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Reactions are currently unavailable