[moe training] update readme with links, cleanup (#3239)

danielvegamyhre · web-flow · commit e9c7bead90b8 · 2025-10-24T07:30:58.000-07:00
diff --git a/torchao/prototype/moe_training/README.md b/torchao/prototype/moe_training/README.md
@@ -2,32 +2,38 @@
 
 This prototype provides:
 
-1. Quantized building block for low precision MoE training: `_scaled_grouped_mm`. It is a differentiable drop-in replacement for `torch._grouped_mm` that dynamically quantizes inputs using the given recipe, performs a scaled grouped GEMM, then returns the results in original precision. See runnable [example](#torchao_scaled_grouped_mm-example-forward--backward-pass) of a forward and backward pass below.
+1. Quantized building block for low precision MoE training: [_quantize_then_scaled_grouped_mm](https://github.com/pytorch/ao/blob/53b5efdac921a38fd15e8d3ac8191c3927140287/torchao/prototype/moe_training/scaled_grouped_mm.py#L42). It is a differentiable drop-in replacement for `torch._grouped_mm` that dynamically quantizes inputs using the given recipe, performs a scaled grouped GEMM, then returns the results in original precision. See runnable [example](#torchao_scaled_grouped_mm-example-forward--backward-pass) of a forward and backward pass below.
     - Using MXFP8 on a B200 GPU, this provides:
-        - ~1.4x - 1.8x speedups over bfloat16 `torch._grouped_mm` for Llama4 17b 16e shapes (depending on the `M` dimension, i.e. batch_size * seq_len)
-        - ~1.15 - 1.3x speedups over bfloat16 `torch._grouped_mm` for DeepSeekV3 671b shapes (depending on the `M` dimension, i.e. batch_size * seq_len)
+        - **~1.4x - 1.8x speedups** over bfloat16 `torch._grouped_mm` for Llama4 Scout shapes
+        - **~1.15 - 1.3x speedups** over bfloat16 `torch._grouped_mm` for DeepSeekV3 671b shapes
+    - We also provide the following convenience functions for specific recipes:
+        - [_to_mxfp8_then_scaled_grouped_mm](https://github.com/pytorch/ao/blob/53b5efdac921a38fd15e8d3ac8191c3927140287/torchao/prototype/moe_training/scaled_grouped_mm.py#L677)
+        - [_to_fp8_rowwise_then_scaled_grouped_mm](https://github.com/pytorch/ao/blob/53b5efdac921a38fd15e8d3ac8191c3927140287/torchao/prototype/moe_training/scaled_grouped_mm.py#L678)
 
 
-2. [TorchTitan](https://github.com/pytorch/torchtitan/tree/main) integration of torchao's dynamically quantized `_scaled_grouped_mm`: pretrain DeepSeekV3/Llama4 with MXFP8 grouped GEMMs by adding the flag to your training command: `--model.converters="quantize.grouped_mm.mx" [--quantize.grouped_mm.mx.fqns="experts"]`
 
-3. `quantize_(...)` API support for model conversion: this swaps all `torch._grouped_mm` ops in your model definition to use torchao `_scaled_grouped_mm` under the hood (see [example](#model-conversion-api-example-end-to-end-training) below).
+2. [TorchTitan](https://github.com/pytorch/torchtitan/tree/main) integration: pretrain DeepSeekV3/Llama4 with MXFP8 grouped GEMMs by adding the flag to your training command: `--model.converters="quantize.grouped_mm.mx" --quantize.grouped_mm.mx.fqns="experts"`
+
+3. Model conversion API to swap all `torch._grouped_mm` ops in your model definition to use torchao `_quantize_then_scaled_grouped_mm` under the hood (see [example](#model-conversion-api-example-end-to-end-training) below).
 
 
 ## Table of Contents
 
 - [Examples](#examples)
-- [Performance Benchmarks](#performance-benchmarks-mxfp8)
 - [System Requirements](#system-requirements)
+- [Microbenchmarks](#microbenchmarks)
+- [Single MoE layer benchmarks](#benchmark-single-moe-layer-forward--backward-pass)
+- [E2E training benchmarks](#end-to-end-training-benchmark-with-torchtitan-llama4-scout-vs-bfloat16-baseline)
 - [Implementation Details for Developers](#implementation-details-for-developers)
 - [Limitations](#limitations)
 
 ## Examples
-#### torchao_scaled_grouped_mm example: forward + backward pass
+#### _quantize_then_scaled_grouped_mm usage
 ```python
 import torch
 from torch.nn import functional as F
 from torchao.prototype.moe_training import (
-    _quantize_then_scaled_grouped_mm as torchao_scaled_grouped_mm
+    _quantize_then_scaled_grouped_mm
 )
 from torchao.prototype.moe_training.conversion_utils import MoEScalingType
 from torchao.prototype.moe_training.utils import generate_jagged_offs
@@ -42,7 +48,7 @@ B = torch.randn(num_groups, N, K, dtype=torch.bfloat16, device="cuda", requires_
 offs = generate_jagged_offs(num_groups, total_M, device="cuda")
 
 # Forward and backward example
-out = torchao_scaled_grouped_mm(
+out = _quantize_then_scaled_grouped_mm(
         A,
         B.transpose(-2, -1),
         offs=offs,
@@ -127,39 +133,18 @@ for step in range(10):
     # backward pass
     out_loss.backward()
     optimizer.step()
-
+    optimizer.zero_grad()
 ```
 
 ## System requirements
 - torchao 0.14+
 - For MXFP8 MoE training, CUDA 12.8+ and SM100+ GPU arch are required.
 - For FP8 rowwise MoE training, CUDA 12.4+ and SM89+ GPU arch are required.
 
-## Performance benchmarks: MXFP8
-
-
-### Single MoE layer forward + backward pass vs bfloat16 baseline
-
-| Model        | total_M | N    | K    | bf16 time (ms) | mxfp8 time (ms) | speedup |
-|--------------|---------|------|------|---------------|-----------------|---------|
-| Llama4 16e   | 131072  | 8192 | 5120 | 275.270       | 192.420         | 1.431x  |
-| DeepSeekV3   | 131072  | 2048 | 7168 | 92.032        | 80.182          | 1.148x  |
-
-To reproduce these benchmarks, on a B200 GPU machine, run the following commands:
-
-Llama4 17b 16e shapes:
-```bash
-CUDA_VISIBLE_DEVICES=6 python benchmarks/prototype/moe_training/bench_moe_layer.py --recipe mxfp8 --local_batch_size=16 --dim=5120 --hidden_dim=8192 --local_num_experts=8
-```
-
-DeepSeekV3 671b shapes:
-```bash
-CUDA_VISIBLE_DEVICES=6 python benchmarks/prototype/moe_training/bench_moe_layer.py --recipe mxfp8 --local_batch_size=16 --dim=7168 --hidden_dim=2048 --local_num_experts=8
-```
 
-### Individual bfloat16 torch._grouped_mm op vs torchao_scaled_grouped_mm
+## Microbenchmarks
 
-**MXFP8 with Llama4 17b 16e shapes** (with G=1-8 to simulate different degrees of expert parallelism)
+**MXFP8 with Llama4 17b 16e shapes** (with G from 1 to 8 to simulate different degrees of expert parallelism)
 
 | M,N,K,G                 | bf16_fwd_bwd_us | scaled_fwd_bwd_us | scaled_fwd_bwd_speedup |
 | ----------------------- | --------------: | ----------------: | ---------------------: |
@@ -168,7 +153,7 @@ CUDA_VISIBLE_DEVICES=6 python benchmarks/prototype/moe_training/bench_moe_layer.
 | (128000, 8192, 5120, 4) |        39189.20 |          23945.50 |                 1.637x |
 | (128000, 8192, 5120, 8) |        37700.70 |          22170.60 |                 1.700x |
 
-**MXFP8 with DeepSeekV3** (with G=-8 to simulate different degrees of expert parallelism)
+**MXFP8 with DeepSeekV3** (with G from 1 to 8 to simulate different degrees of expert parallelism)
 
 | M,N,K,G                 | bf16_fwd_bwd_us | scaled_fwd_bwd_us | scaled_fwd_bwd_speedup |
 | ----------------------- | --------------: | ----------------: | ---------------------: |
@@ -183,8 +168,28 @@ To reproduce this benchmark, on a B200 GPU machine, run the following command:
 - torchao: `0.14.0+gitc7b8e13da`
 - torch: `2.10.0a0+gitf6de195`
 
+## Benchmark: single MoE layer forward + backward pass
+
+| Model        | total_M | N    | K    | bf16 time (ms) | mxfp8 time (ms) | speedup |
+|--------------|---------|------|------|---------------|-----------------|---------|
+| Llama4 16e   | 131072  | 8192 | 5120 | 275.270       | 192.420         | 1.431x  |
+| DeepSeekV3   | 131072  | 2048 | 7168 | 92.032        | 80.182          | 1.148x  |
+
+To reproduce these benchmarks, on a B200 GPU machine, run the following commands:
+
+Llama4 17b 16e shapes:
+```bash
+CUDA_VISIBLE_DEVICES=6 python benchmarks/prototype/moe_training/bench_moe_layer.py --recipe mxfp8 --local_batch_size=16 --dim=5120 --hidden_dim=8192 --local_num_experts=8
+```
+
+DeepSeekV3 671b shapes:
+```bash
+CUDA_VISIBLE_DEVICES=6 python benchmarks/prototype/moe_training/bench_moe_layer.py --recipe mxfp8 --local_batch_size=16 --dim=7168 --hidden_dim=2048 --local_num_experts=8
+```
+
+
 
-#### End-to-end training: Llama4 16e MoE layer vs bfloat16 baseline with TorchTitan
+## End-to-end training benchmark with TorchTitan: Llama4 Scout vs bfloat16 baseline
 - Single node benchmarks with 4xB200
 - Llama4 16e default configs; FSDP=4, EP=4; AC=none; compile=True; seq_len=8192; local_bs=8
 - Reduced num layers from 48 -> 2 to avoid OOM in single node setting