fix: handle zero-size tensors in MoE token dispatchers by callum-ward-inflection · Pull Request #3626 · NVIDIA/Megatron-LM

callum-ward-inflection · 2026-02-26T16:33:32Z

What does this PR do ?

Fixes MoE token dispatcher crash when an Expert Parallelism rank receives zero tokens from the router.

When an EP rank receives zero tokens from the router, the fused permute/unpermute autograd pair breaks and .view() crashes on a zero-size tensor. Switch to unfused path symmetrically, reconnect gradient graph for backward collectives, and guard .view() in all three dispatcher classes.

Fixes: #1877

Changes

Three changes to megatron/core/transformer/moe/token_dispatcher.py:

Symmetric unfused permute/unpermute for empty EP ranks — TE's fused_permute saves state that fused_unpermute reads during backward. With zero tokens this state is invalid. Both dispatch_postprocess and combine_preprocess detect zero-token ranks and fall back to the unfused PyTorch path together (they must match — mixing fused permute with unfused unpermute crashes due to incompatible index formats).
Gradient connectivity for backward collectives — Unfused unpermute with zero tokens returns a tensor disconnected from the autograd graph. During backward, distributed collectives (AllGather/ReduceScatter) need every rank to participate. A detached tensor means one rank never triggers its collective and all others hang (NCCL timeout). Reconnected with unpermuted_local_hidden + hidden_states.sum() * 0.
Safety guard on .view() for all three dispatcher classes — MoEAllGatherTokenDispatcher, MoEAlltoAllTokenDispatcher, and MoEFlexTokenDispatcher all have .view(self.hidden_shape) that crashes on zero-size input. Guarded with a zero tensor of the correct shape.

Reproduction

Reproducible with openai/gpt-oss-20b (32 experts, hidden_size=2880) SFT with TP=4, EP=2 on 16 GPUs. The pretrained router's weight distribution combined with reduced tokens-per-rank from TP=4 sequence parallelism causes certain EP ranks to consistently receive zero tokens.

Testing

Config	Result
TP=4, EP=2 (previously crashing)	PASS — 20 steps
TP=2, EP=2	PASS
TP=1, EP=2	PASS
TP=4, EP=4	PASS

Contribution process

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

When an EP rank receives zero tokens from the router, the fused permute/unpermute autograd pair breaks and .view() crashes on a zero-size tensor. Switch to unfused path symmetrically, reconnect gradient graph for backward collectives, and guard .view() in all three dispatcher classes. Fixes: NVIDIA#1877

copy-pr-bot · 2026-02-26T16:33:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Victarry · 2026-02-27T01:36:41Z

Hi @callum-ward-inflection , I think the ideal way to resolve this issue is to fix the fused permute kernel. Adding more conditions in the token dispatcher will make it more complex.

Could you post the detailed error you facing with zero-tokens? IIRC, the TE fused permute kernel should support the zero-token case. cc @hxbai

hxbai · 2026-02-27T05:12:35Z

I agree with @Victarry , it is better to create a fix to TE's permute function rather than to MCore.

callum-ward-inflection requested review from a team as code owners February 26, 2026 16:33

svcnvidia-nemo-ci requested a review from a team February 26, 2026 16:33

github-actions bot added the community-request label Feb 26, 2026

ericharper added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Feb 26, 2026

kvareddy approved these changes Feb 26, 2026

View reviewed changes

yaox12 requested review from Autumn1998 and Victarry February 27, 2026 01:28

chtruong814 added the needs-follow-up Issue needs follow-up label Feb 28, 2026

Phlip79 added the complexity: low label Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle zero-size tensors in MoE token dispatchers#3626

fix: handle zero-size tensors in MoE token dispatchers#3626
callum-ward-inflection wants to merge 1 commit intoNVIDIA:mainfrom
callum-ward-inflection:cw/fix-moe-zero-size-tensor

callum-ward-inflection commented Feb 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

Victarry commented Feb 27, 2026

Uh oh!

hxbai commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

callum-ward-inflection commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changes

Reproduction

Testing

Contribution process

Pre-checks

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

Victarry commented Feb 27, 2026

Uh oh!

hxbai commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

callum-ward-inflection commented Feb 26, 2026 •

edited

Loading