Megatron tensor parallel sharding for MLP #631

kmalik22 · 2025-09-08T04:29:17Z

Summary

Adds tensor-parallel sharding for Attention and MLP layers
Adds a unit test for the above
Adds scripts to test, benchmark and profile forward on MLP and Attention blocks with different shapes with default and megatron sharding

Test

uv run -m pytest src/openpi/models/megatron_sharding_test.py -v

Scripts

All scripts are in scripts/kmalik2

test_feedforward_inference.py: Test, profile and benchmark timing for MLP block with default and megatron sharding
test_attention_inference.py: Test, profile and benchmark timing for Attention block with default and megatron sharding
profile_all.py: Wrapper script to generate profiles for a few different configuratoins
timing_sweep.py: Wrapper script to run feedforward inference for a few different default and megatrion configs, dump timing information in a csv
timing_sweep_attention.py: Wrapper script to run feedforward inference for a few different default and megatrion configs, dump timing information in a csv

Summary of MLP benchmark results

For small values of (BT), megatron is better. For larger values of (BT) default sharding is better.
The graph below shows forward latency for fixed values of model_dim, hidden_dim, num_shards and batch_size. The only thing changing is sequence length

The same graph shown as a speedup

Summary of Attention benchmark results

Similar to MLP, megatron comms are O(num_activations) while fsdp comms are O(num_parameters)
If BT is smaller than 4D_MODEL, megatron is useful. Otherwise stick to fsdp.

Speedup

…g script

… sharding.py

…qlens

…e scan can cause badness

kmalik22 and others added 25 commits September 6, 2025 17:19

Standalone ffn inference

7895225

Sharded and unsharded inference versions

c8e5776

Run twice: once for debug info, once for timing info

0207f0d

Debug mode and Timing mode separate

a716c69

Working version with profiling

9ed72d0

Fixes for running on gpu

0726164

Working version with minimal forward

ea53940

Adding explicit all-gather to the FFN, improved inference benchmarkin…

8373de0

…g script

Updates to inference script: correct implementation of default sharding

5cbd45b

timing sweep script

9f59a54

Cleaner megatron activation sharding + unit test

9b1770d

Refactor megatron parameter sharding into a cleaner function, move to…

d82762e

… sharding.py

Created using Colab

1634831

Add input logging to megatron debug

d222073

Inference benchmarking for attention

270f39e

Adding attention megatron split

f826d00

Fixes for megatron for attention. cleanups

b504475

Improve testing. Add megatron to Gemma

742b0d7

Refactor inference test for easier changes

65a4d7b

Adding script to run attention timing benchmark sweeping over diff se…

e2522b9

…qlens

Better comments, always shard attn WO on head-dim

21b39ff

Add test for attention megatron sharding. index shapes using -1, mayb…

24ce33d

…e scan can cause badness

Code cleanup

2536cc8

Cleanup test

25d327e

cleanup

5497881

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Megatron tensor parallel sharding for MLP #631

Megatron tensor parallel sharding for MLP #631

Uh oh!

kmalik22 commented Sep 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Megatron tensor parallel sharding for MLP #631

Are you sure you want to change the base?

Megatron tensor parallel sharding for MLP #631

Uh oh!

Conversation

kmalik22 commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test

Scripts

Summary of MLP benchmark results

Summary of Attention benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kmalik22 commented Sep 8, 2025 •

edited

Loading