fix(ui): enable rocm-smi support by correcting flags and parsing #580

Soddentrough · 2025-12-07T09:23:54Z

ai-toolkit runs on systems with AMD GPUs but displays an error about 'nvidia-smi' in the dashboard when doing so.

This patch removes the hard-coded dependency on 'nvidia-smi' allowing ai-toolkit to operate with either 'nvidia-smi' or 'rocm-smi'. It first checks for 'nvidia-smi' and then checks for 'rocm-smi' which may cause an issue if both are installed but it solves a need today.

dkspwndj · 2025-12-07T11:17:40Z

Thank you for proceeding with the modification!! But it's not running properly.. I look forward to seeing good results in the future!!

Soddentrough · 2025-12-07T22:22:25Z

Thinking this might be a path issue (where is your rocm-smi installed?) I will update the logic to check for rocm-smi in this order:

which rocm-smi
2, Check for $ROCM_PATH/bin/rocm-smi
Check for /usr/bin/rocm-smi
Check for /opt/rocm/bin/rocm-smi

Soddentrough · 2025-12-08T06:22:50Z

There was a parsing issue when handling the JSON output of rocm-smi creating a phantom device.

dkspwndj · 2025-12-08T14:31:50Z

Forgive me.. I've been laying down the ROCm to the extent that I need the CompyUI with Zluda and then I figured out if the ROCm was properly laid. Now I'll make a separate PyTorch 3.12 folder to lay the ROCm and try it out there..

dkspwndj · 2025-12-08T14:51:41Z

Now tested ROCm 7.1.1 installed venv.
But.. not work well..

Soddentrough · 2025-12-08T20:32:24Z

I'm sorry I didn't notice you're testing with a Windows system. rocm-smi is only available on Linux or WSL. Might be able to use hipinfo.exe on Windows to enumerate the devices but I don't think that has dynamic performance statistics for power/utilization/mem, so stats would show "0".

I don't currently have a way of testing this though and I think for Windows maybe using "Get-Counter" for dynamic performance counters could be the way to go.

… GTT support

Soddentrough · 2025-12-08T22:14:36Z

This now uses amd-smi by default with fallback to rocm-smi. And where amd-smi doesn't fully support a GPU (eg: Strix iGPU) we use the sysfs hwmon metrics. This also allows us to show "VRAM" and "GTT" (shared memory) used by an APU.

mickabrig7 · 2025-12-19T00:50:16Z

Hello and thanks for your PR !
Unfortunately it doesn't seem like it's working on my system running a 9070 XT (RDNA 4, gfx1201):

> ai-toolkit-ui@0.1.0 start
> concurrently --restart-tries -1 --restart-after 1000 -n WORKER,UI "node dist/cron/worker.js" "next start --port 8675"

[WORKER] TOOLKIT_ROOT: /home/mickabrig7/ai-toolkit
[WORKER] Cron worker started with interval: 1000 ms
[UI]    ▲ Next.js 15.1.7
[UI]    - Local:        http://localhost:8675
[UI]    - Network:      http://192.168.0.100:8675
[UI]
[UI]  ✓ Starting...
[UI]  ✓ Ready in 242ms
[UI] (node:8719) Warning: `--localstorage-file` was provided without a valid path
[UI] (Use `node --trace-warnings ...` to show where the warning was created)
[UI] state { loading: true, gpuData: null, error: null, lastUpdated: null }

I'm running CachyOS with the rocm-smi-lib and amdsmi packages, Pytorch is installed in my venv using the following command:

uv pip install --upgrade --index-url https://download.pytorch.org/whl/nightly/rocm7.1 --pre torch torchaudio torchvision

Here's some more information about my setup:

 uname -srv
Linux 6.18.1-2-cachyos #1 SMP PREEMPT_DYNAMIC Sat, 13 Dec 2025 20:30:10 +0000

 which amd-smi
/opt/rocm/bin/amd-smi

 which rocm-smi
/opt/rocm/bin/rocm-smi

Soddentrough · 2025-12-19T06:26:49Z

The error you are getting should not exist as I replaced it with a vendor neutral "GPUs not detected" style message.

Can you run :

$ git rev-parse HEAD

mickabrig7 · 2025-12-19T06:37:25Z

Ugh sorry, a bit shameful but I was naively cloning your fork without checking the branch and I was just on the default main, nevermind 🫠

Soddentrough · 2025-12-19T06:48:03Z

As long as I'm not the only one making those sorts of mistakes! :D

rncz · 2025-12-27T16:48:47Z

Works for me now, thanks a lot!

Training Z Image Turbo right now at around 3.5sec/it. My setup AMD RX7700XT 12 GB VRAM and Ubuntu 24.04 LTS:

What I did:

Installed ROCm through the AMD docs, so that I have a symlink at /opt/rocm
amd-smi and rocm-info worked fine on the command line, but I had to make sure to run it as root user. my default user did not have access to amd-smi.
I compiled the bitsandbytes package from the AMD docs

fix(ui): enable rocm-smi support by correcting flags and parsing

29cc2fc

Soddentrough added 2 commits December 8, 2025 09:24

Fix rocm-smi detection and improve error messages

ab97aea

Fix ghost device in rocm-smi by filtering system object

2e3e3ef

Add prelim hipinfo.exe support for Windows AMD GPUs

ead7a66

feature: Improve AMD GPU monitoring with amd-smi, sysfs fallback, and…

86823d7

… GTT support

Uh oh!

fix(ui): enable rocm-smi support by correcting flags and parsing #580

Are you sure you want to change the base?

fix(ui): enable rocm-smi support by correcting flags and parsing #580

Uh oh!

Conversation

Soddentrough commented Dec 7, 2025

Uh oh!

dkspwndj commented Dec 7, 2025

Uh oh!

Soddentrough commented Dec 7, 2025

Uh oh!

Soddentrough commented Dec 8, 2025

Uh oh!

dkspwndj commented Dec 8, 2025

Uh oh!

dkspwndj commented Dec 8, 2025

Uh oh!

Soddentrough commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Soddentrough commented Dec 8, 2025

Uh oh!

mickabrig7 commented Dec 19, 2025

Uh oh!

Soddentrough commented Dec 19, 2025

Uh oh!

mickabrig7 commented Dec 19, 2025

Uh oh!

Soddentrough commented Dec 19, 2025

Uh oh!

rncz commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Soddentrough commented Dec 8, 2025 •

edited

Loading

rncz commented Dec 27, 2025 •

edited

Loading