Skip to content

Conversation

@bremerm31
Copy link
Contributor

Summary:
Adding gbps metric and setting kernels as memory bound for mx4_to_fp32 and fp32_to_mx4. On H100, we see

<tritonbench run>  --op mx4_to_fp32 --device cuda --metrics gbps,hw_roofline,latency

prints

  (Size, Group Size, ebits, mbits)    hw_roofline    fbgemm_mx4_to_fp32-gbps    fbgemm_mx4_to_fp32-latency
----------------------------------  -------------  -------------------------  ----------------------------
                  (6392, 32, 2, 1)           2000                     8.6048             0.006336 (±7.07%)
                (278528, 32, 2, 1)           2000                   335.928              0.007072 (±8.14%)
              (17825792, 32, 2, 1)           2000                  1874.3                0.081120 (±1.70%)
              (17825809, 32, 2, 1)           2000                  1879.5                0.080896 (±1.90%)
<tritonbench run>  --op fp32_to_mx4 --device cuda --metrics gbps,hw_roofline,latency

prints

  (Size, Group Size, ebits, mbits, rounding_mode, stochastic_casting)    hw_roofline    fbgemm_fp32_to_mx4-gbps    fbgemm_fp32_to_mx4-latency
---------------------------------------------------------------------  -------------  -------------------------  ----------------------------
                     (24048, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                    15.6924             0.006944 (±6.91%)
                   (1048576, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                   412.444              0.011520 (±5.00%)
                  (67108864, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                  1734.07               0.175360 (±0.53%)
                  (67108880, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                  1733.76               0.175392 (±0.64%)

Differential Revision: D85782147

Summary:
Adding gbps metric and setting kernels as memory bound for `mx4_to_fp32` and `fp32_to_mx4`. On H100, we see

```
<tritonbench run>  --op mx4_to_fp32 --device cuda --metrics gbps,hw_roofline,latency
```
prints
```
  (Size, Group Size, ebits, mbits)    hw_roofline    fbgemm_mx4_to_fp32-gbps    fbgemm_mx4_to_fp32-latency
----------------------------------  -------------  -------------------------  ----------------------------
                  (6392, 32, 2, 1)           2000                     8.6048             0.006336 (±7.07%)
                (278528, 32, 2, 1)           2000                   335.928              0.007072 (±8.14%)
              (17825792, 32, 2, 1)           2000                  1874.3                0.081120 (±1.70%)
              (17825809, 32, 2, 1)           2000                  1879.5                0.080896 (±1.90%)
```

```
<tritonbench run>  --op fp32_to_mx4 --device cuda --metrics gbps,hw_roofline,latency
```
prints
```
  (Size, Group Size, ebits, mbits, rounding_mode, stochastic_casting)    hw_roofline    fbgemm_fp32_to_mx4-gbps    fbgemm_fp32_to_mx4-latency
---------------------------------------------------------------------  -------------  -------------------------  ----------------------------
                     (24048, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                    15.6924             0.006944 (±6.91%)
                   (1048576, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                   412.444              0.011520 (±5.00%)
                  (67108864, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                  1734.07               0.175360 (±0.53%)
                  (67108880, 32, 2, 1, <RoundingMode.even: 2>, False)           2000                  1733.76               0.175392 (±0.64%)
```

Differential Revision: D85782147
@meta-codesync
Copy link

meta-codesync bot commented Oct 29, 2025

@bremerm31 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85782147.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants