Skip to content

Conversation

@Soddentrough
Copy link

ai-toolkit runs on systems with AMD GPUs but displays an error about 'nvidia-smi' in the dashboard when doing so.

This patch removes the hard-coded dependency on 'nvidia-smi' allowing ai-toolkit to operate with either 'nvidia-smi' or 'rocm-smi'. It first checks for 'nvidia-smi' and then checks for 'rocm-smi' which may cause an issue if both are installed but it solves a need today.

@dkspwndj
Copy link

dkspwndj commented Dec 7, 2025

rocmcommit01 rocmcommit02 Thank you for proceeding with the modification!! But it's not running properly.. I look forward to seeing good results in the future!!

@Soddentrough
Copy link
Author

image

Thinking this might be a path issue (where is your rocm-smi installed?) I will update the logic to check for rocm-smi in this order:

  1. which rocm-smi
    2, Check for $ROCM_PATH/bin/rocm-smi
  2. Check for /usr/bin/rocm-smi
  3. Check for /opt/rocm/bin/rocm-smi

@Soddentrough
Copy link
Author

image

There was a parsing issue when handling the JSON output of rocm-smi creating a phantom device.

@dkspwndj
Copy link

dkspwndj commented Dec 8, 2025

Forgive me.. I've been laying down the ROCm to the extent that I need the CompyUI with Zluda and then I figured out if the ROCm was properly laid. Now I'll make a separate PyTorch 3.12 folder to lay the ROCm and try it out there..

@dkspwndj
Copy link

dkspwndj commented Dec 8, 2025

Now tested ROCm 7.1.1 installed venv.
But.. not work well..
aitoookiterr

@Soddentrough
Copy link
Author

Soddentrough commented Dec 8, 2025

I'm sorry I didn't notice you're testing with a Windows system. rocm-smi is only available on Linux or WSL. Might be able to use hipinfo.exe on Windows to enumerate the devices but I don't think that has dynamic performance statistics for power/utilization/mem, so stats would show "0".

I don't currently have a way of testing this though and I think for Windows maybe using "Get-Counter" for dynamic performance counters could be the way to go.

@Soddentrough
Copy link
Author

image

This now uses amd-smi by default with fallback to rocm-smi. And where amd-smi doesn't fully support a GPU (eg: Strix iGPU) we use the sysfs hwmon metrics. This also allows us to show "VRAM" and "GTT" (shared memory) used by an APU.

@mickabrig7
Copy link

Hello and thanks for your PR !
Unfortunately it doesn't seem like it's working on my system running a 9070 XT (RDNA 4, gfx1201):

image
> ai-toolkit-ui@0.1.0 start
> concurrently --restart-tries -1 --restart-after 1000 -n WORKER,UI "node dist/cron/worker.js" "next start --port 8675"

[WORKER] TOOLKIT_ROOT: /home/mickabrig7/ai-toolkit
[WORKER] Cron worker started with interval: 1000 ms
[UI]    ▲ Next.js 15.1.7
[UI]    - Local:        http://localhost:8675
[UI]    - Network:      http://192.168.0.100:8675
[UI]
[UI]  ✓ Starting...
[UI]  ✓ Ready in 242ms
[UI] (node:8719) Warning: `--localstorage-file` was provided without a valid path
[UI] (Use `node --trace-warnings ...` to show where the warning was created)
[UI] state { loading: true, gpuData: null, error: null, lastUpdated: null }

I'm running CachyOS with the rocm-smi-lib and amdsmi packages, Pytorch is installed in my venv using the following command:

uv pip install --upgrade --index-url https://download.pytorch.org/whl/nightly/rocm7.1 --pre torch torchaudio torchvision

Here's some more information about my setup:

 uname -srv
Linux 6.18.1-2-cachyos #1 SMP PREEMPT_DYNAMIC Sat, 13 Dec 2025 20:30:10 +0000

 which amd-smi
/opt/rocm/bin/amd-smi

 which rocm-smi
/opt/rocm/bin/rocm-smi

@Soddentrough
Copy link
Author

The error you are getting should not exist as I replaced it with a vendor neutral "GPUs not detected" style message.

Can you run :

$ git rev-parse HEAD

@mickabrig7
Copy link

Ugh sorry, a bit shameful but I was naively cloning your fork without checking the branch and I was just on the default main, nevermind 🫠

@Soddentrough
Copy link
Author

As long as I'm not the only one making those sorts of mistakes! :D

@rncz
Copy link

rncz commented Dec 27, 2025

Works for me now, thanks a lot!

Training Z Image Turbo right now at around 3.5sec/it. My setup AMD RX7700XT 12 GB VRAM and Ubuntu 24.04 LTS:

working

What I did:

  • Installed ROCm through the AMD docs, so that I have a symlink at /opt/rocm
  • amd-smi and rocm-info worked fine on the command line, but I had to make sure to run it as root user. my default user did not have access to amd-smi.
  • I compiled the bitsandbytes package from the AMD docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants