FP4/MXFP4 inference acceleration on GPU: design questions for GSoC 2026 Project 14 #3972
Unanswered
CodersAcademy006
asked this question in
Ideas
Replies: 1 comment
-
|
@Saad-Mallebhari, please provide your opinions as well on this. Also please correct me if i am wrong. Thank You. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi NNCF team,
Working on a proposal for the Triton kernel acceleration project (GSoC 2026)
and want to validate some design assumptions before writing the full proposal.
What I've found in the current codebase:
The docs state that NF4, MXFP4, MXFP8_E4M3 are "experimental on GPU and NPU"
and models compressed to these formats "should not be faster than 8-bit
integer." This accurately describes the current state: there's no GPU-optimized
dequantization kernel for these formats. The dequant path falls back to
unoptimized PyTorch ops at inference time.
Three design questions before I write the proposal:
Format priority: Should the Triton kernel target MXFP4 (E2M1 + E8M0
group scale, the OpenVINO IR native format) or FP4 (E2M1 + FP16 group
scale, the PyTorch-native variant) first? They have different dequant
arithmetic despite the same weight format.
torch.compile registration: For torch.compile compatibility, should
the Triton kernel be registered via torch.library.custom_op with a
fake/abstract implementation, or is there a preferred NNCF pattern
for registering custom ops that I should follow?
Benchmark target: For measuring speedup, is the goal to compare
against uncompressed FP16 inference, or against the current
CompressWeightsMode.INT4_SYMpath on the same hardware?I have a minimal FP4 dequant kernel prototype I can share as a starting
point once these design questions are resolved.
Beta Was this translation helpful? Give feedback.
All reactions