-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Hi, I’m experimenting with IRON on an AI Max+ 395 and noticed that I’m only seeing ~700 GFLOP/s, like below, when running the example gemm pytest with 8 columns.
I am using the devel branch.
operators/gemm/test.py::test_gemm[iter0-gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_0_ccolmaj_0_0]
Latency (us): 24679.3
Effective Bandwidth: 1.019712e+00 GB/s
Throughput: 6.961237e+02 GFLOP/s
PASSED
Here is the output of xrt-smi validate on my machine
Validate Device : [0000:c6:00.1]
Platform : NPU Strix Halo
Power Mode : Default
-------------------------------------------------------------------------------
Test 1 [0000:c6:00.1] : gemm
Details : TOPS: 51.0
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:c6:00.1] : latency
Details : Average latency: 52.0 us
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:c6:00.1] : throughput
Details : Average throughput: 78771.0 op/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Based on the spec of AIE2p cores, which shows the throughput of BF16 is half of INT8, I am expecting a throughput that is close to the order of magnitude of 50/2=25TOPS.
0.7TOPS looks far away from my expectations, thus I wonder if this is due to some configuration problems, or wrong understanding, or something else?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels