Skip to content

[Issue]: amd-smi USED_VRAM does not reflect real-time memory allocations/deallocations #175

@zyzshishui

Description

@zyzshishui

Problem Description

amd-smi metric --mem reports USED_VRAM that does not update when memory is allocated or freed. The value remains constant regardless of GPU memory operations. In contrast, hipMemGetInfo() correctly reports real-time memory usage.

Impact

  • torch.cuda.device_memory_used() returns inaccurate data on ROCm, which relies on amdsmi_get_gpu_vram_usage().
  • Any other memory monitoring tools using amd-smi to track real-time GPU memory usage.

Operating System

Linux (kernel 6.8.0)

CPU

AMD EPYC 9575F 64-Core Processor

GPU

AMD Instinct MI355X (gfx950)

ROCm Version

7.0.0

ROCm Component

amdsmi

Steps to Reproduce

import subprocess
import torch

def get_amd_smi_vram_mb():
    result = subprocess.run(['amd-smi', 'metric', '--gpu', '0', '--mem'], 
                          capture_output=True, text=True)
    for line in result.stdout.split('\n'):
        if 'USED_VRAM' in line and 'VISIBLE' not in line:
            return int(line.split(':')[1].strip().replace(' MB', ''))
    return -1

def get_hip_mem_used_mb():
    free, total = torch.cuda.mem_get_info(0)
    return (total - free) // (1024 * 1024)

# Initial state
amd_initial = get_amd_smi_vram_mb()
hip_initial = get_hip_mem_used_mb()
print(f"Initial:  amd-smi={amd_initial} MB, hipMemGetInfo={hip_initial} MB")

# Allocate 2GB
tensor = torch.zeros(2_000_000_000, dtype=torch.uint8, device='cuda:0')
torch.cuda.synchronize()
amd_after = get_amd_smi_vram_mb()
hip_after = get_hip_mem_used_mb()
print(f"After 2GB alloc: amd-smi={amd_after} MB (Δ={amd_after-amd_initial}), hipMemGetInfo={hip_after} MB (Δ={hip_after-hip_initial})")

# Free
del tensor
torch.cuda.empty_cache()
amd_free = get_amd_smi_vram_mb()
hip_free = get_hip_mem_used_mb()
print(f"After free: amd-smi={amd_free} MB, hipMemGetInfo={hip_free} MB")

Expected Output

  • Initial: amd-smi=300 MB, hipMemGetInfo=634 MB
  • After 2GB alloc: amd-smi=2348 MB (Δ=2048), hipMemGetInfo=2692 MB (Δ=2058)
  • After free: amd-smi=300 MB, hipMemGetInfo=784 MB

Actual Output

  • Initial: amd-smi=283 MB, hipMemGetInfo=634 MB
  • After 2GB alloc: amd-smi=284 MB (Δ=1), hipMemGetInfo=2692 MB (Δ=2058)
  • After free: amd-smi=284 MB, hipMemGetInfo=784 MB

amd-smi shows Δ=1 MB while hipMemGetInfo correctly shows Δ=2058 MB

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions