Skip to content

fix: handle zero-size tensors in MoE token dispatchers#3626

Open
callum-ward-inflection wants to merge 1 commit intoNVIDIA:mainfrom
callum-ward-inflection:cw/fix-moe-zero-size-tensor
Open

fix: handle zero-size tensors in MoE token dispatchers#3626
callum-ward-inflection wants to merge 1 commit intoNVIDIA:mainfrom
callum-ward-inflection:cw/fix-moe-zero-size-tensor

Conversation

@callum-ward-inflection
Copy link

@callum-ward-inflection callum-ward-inflection commented Feb 26, 2026

What does this PR do ?

Fixes MoE token dispatcher crash when an Expert Parallelism rank receives zero tokens from the router.

When an EP rank receives zero tokens from the router, the fused permute/unpermute autograd pair breaks and .view() crashes on a zero-size tensor. Switch to unfused path symmetrically, reconnect gradient graph for backward collectives, and guard .view() in all three dispatcher classes.

Fixes: #1877

Changes

Three changes to megatron/core/transformer/moe/token_dispatcher.py:

  1. Symmetric unfused permute/unpermute for empty EP ranks — TE's fused_permute saves state that fused_unpermute reads during backward. With zero tokens this state is invalid. Both dispatch_postprocess and combine_preprocess detect zero-token ranks and fall back to the unfused PyTorch path together (they must match — mixing fused permute with unfused unpermute crashes due to incompatible index formats).

  2. Gradient connectivity for backward collectives — Unfused unpermute with zero tokens returns a tensor disconnected from the autograd graph. During backward, distributed collectives (AllGather/ReduceScatter) need every rank to participate. A detached tensor means one rank never triggers its collective and all others hang (NCCL timeout). Reconnected with unpermuted_local_hidden + hidden_states.sum() * 0.

  3. Safety guard on .view() for all three dispatcher classesMoEAllGatherTokenDispatcher, MoEAlltoAllTokenDispatcher, and MoEFlexTokenDispatcher all have .view(self.hidden_shape) that crashes on zero-size input. Guarded with a zero tensor of the correct shape.

Reproduction

Reproducible with openai/gpt-oss-20b (32 experts, hidden_size=2880) SFT with TP=4, EP=2 on 16 GPUs. The pretrained router's weight distribution combined with reduced tokens-per-rank from TP=4 sequence parallelism causes certain EP ranks to consistently receive zero tokens.

Testing

Config Result
TP=4, EP=2 (previously crashing) PASS — 20 steps
TP=2, EP=2 PASS
TP=1, EP=2 PASS
TP=4, EP=4 PASS

Contribution process

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

When an EP rank receives zero tokens from the router, the fused
permute/unpermute autograd pair breaks and .view() crashes on a
zero-size tensor. Switch to unfused path symmetrically, reconnect
gradient graph for backward collectives, and guard .view() in all
three dispatcher classes.

Fixes: NVIDIA#1877
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team February 26, 2026 16:33
@ericharper ericharper added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Feb 26, 2026
@Victarry
Copy link
Contributor

Hi @callum-ward-inflection , I think the ideal way to resolve this issue is to fix the fused permute kernel. Adding more conditions in the token dispatcher will make it more complex.

Could you post the detailed error you facing with zero-tokens? IIRC, the TE fused permute kernel should support the zero-token case. cc @hxbai

@hxbai
Copy link
Contributor

hxbai commented Feb 27, 2026

I agree with @Victarry , it is better to create a fix to TE's permute function rather than to MCore.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request complexity: low Expert Review Apply this label to indicate that your PR is ready for expert review. needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

moe-token-dispatcher-type alltoall error

7 participants