Skip to content

Conversation

@danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Oct 17, 2025

Stacked PRs:


[mxfp8 moe training] bench and profile mxfp8 a2a fwd and bwd separately

Single node benchmark on 4xB200 over NVLink shows perf is flat or slightly slower in this setting.

BF16 forward:
Screenshot 2025-10-27 at 3 37 09 PM

MXFP8 forward:
Screenshot 2025-10-27 at 3 36 54 PM

TL;DR is the transfer of the fp8 data is ~1.72x faster than the bf16 data, but the extra overhead of e8m0 scale transfer, quantization and dequantization all adds up to make the perf flat/slower.

input_shape         num_splits    fwd_bf16_ms    fwd_mxfp8_ms    bwd_bf16_ms    bwd_mxfp8_ms
----------------  ------------  -------------  --------------  -------------  --------------
(1, 8192, 5120)              4       0.269697        0.479122       0.695028        0.993789
(2, 8192, 5120)              4       0.347715        0.468697       0.791324        0.872646
(4, 8192, 5120)              4       0.593996        0.585684       1.28176         1.24674
(8, 8192, 5120)              4       1.53808         1.03233        2.38809         2.3224
(16, 8192, 5120)             4       1.77031         1.8789         4.36899         4.46669

Next steps

  • If we avoid dequant and stay in mxfp8 through the token shuffle and grouped gemm we can probably get a net speedup. I discussed with @tianyu-l at PTC and we have some early ideas on how this might work w/ torchtitan without being overly intrusive. In the meantime, we need:
    1. mxfp8 token shuffle kernel
    2. update to_mxfp8_then_scaled_grouped_mm to optionally accept scales (pre-quantized tensors)

stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3203

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre added a commit that referenced this pull request Oct 17, 2025
stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/81 branch from 2ab06e3 to da3608b Compare October 17, 2025 17:23
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025
@danielvegamyhre danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes moe labels Oct 17, 2025
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/80 to main October 17, 2025 21:04
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/81 branch from da3608b to e6959d4 Compare October 17, 2025 21:05
danielvegamyhre added a commit that referenced this pull request Oct 17, 2025
stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/80 October 17, 2025 21:05
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/80 to main October 17, 2025 22:02
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/81 branch from e6959d4 to 19a4679 Compare October 17, 2025 22:02
danielvegamyhre added a commit that referenced this pull request Oct 17, 2025
stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/80 October 17, 2025 22:02
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/80 to main October 28, 2025 15:52
danielvegamyhre added a commit that referenced this pull request Oct 28, 2025
stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/81 branch from 19a4679 to 9b576ec Compare October 28, 2025 15:52
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/80 October 28, 2025 15:53
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/80 to main October 28, 2025 19:31
danielvegamyhre added a commit that referenced this pull request Oct 28, 2025
stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/81 branch from 9b576ec to 767d725 Compare October 28, 2025 19:31
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/80 October 28, 2025 19:31
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/80 branch from 0824320 to 79d9918 Compare October 28, 2025 19:36
danielvegamyhre added a commit that referenced this pull request Oct 28, 2025
stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/81 branch from 767d725 to ed4963c Compare October 28, 2025 19:36
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/80 branch from 79d9918 to bae2b76 Compare October 28, 2025 19:41
danielvegamyhre added a commit that referenced this pull request Oct 28, 2025
stack-info: PR: #3203, branch: danielvegamyhre/stack/81
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/81 branch 2 times, most recently from 7e52c7e to 63db07b Compare October 28, 2025 19:45
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/80 to main October 28, 2025 19:45
@danielvegamyhre danielvegamyhre merged commit dffb3a0 into main Oct 28, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. moe mx topic: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants