Take fp64 for example, int A100 tensor core flops = 2x cuda core. However, AmgT just use 1/8 tensor core, so it's slower than cuda core.