Skip to content

[Feature] Expert parallelism support for MoE models#96

Open
NikitosKh wants to merge 3 commits intosgl-project:mainfrom
NikitosKh:feature/expert-parallelism
Open

[Feature] Expert parallelism support for MoE models#96
NikitosKh wants to merge 3 commits intosgl-project:mainfrom
NikitosKh:feature/expert-parallelism

Conversation

@NikitosKh
Copy link
Contributor

This adds expert parallelism (EP) for MoE models. The approach follows SGLang's EP design — dispatch tokens to the rank that owns the target expert via all-to-all, compute locally using the existing fused MoE kernels from #59, then combine results back.

Depends on #93 for the streaming weight loader (which also handles EP expert partitioning).

How it works

Instead of TP-sharding every expert's intermediate dimension, EP gives each rank num_experts / ep_size complete experts. The forward pass for each MoE layer then does:

  1. Run the replicated gate to get top-k expert IDs (same as before)
  2. Sort token-expert pairs by destination rank, exchange via all_to_all_single
  3. Run fused_experts_impl locally on received tokens
  4. Send results back via another all-to-all
  5. Un-permute and apply routing weights

No new kernels — steps 1 and 3 reuse fused_topk and fused_experts_impl directly.

A few things worth calling out:

  • When EP is active, NCCL is used as the main process group backend (not gloo). I tried creating a separate NCCL subgroup on top of gloo, but that causes a ~72 GiB memory imbalance across ranks from NCCL's internal buffer allocation. Using NCCL as the world group and creating a gloo subgroup for CPU coordination avoids this entirely.
  • PyNCCL still gets layered on top for TP all-reduce, so attention layers use the same fast path as before.
  • CUDA graphs are auto-disabled since the all-to-all token counts vary per batch.

What changed

1 new file, 9 modified, 1 doc update:

File What
moe/ep.py New EP backend
distributed/info.py EP rank/size (same pattern as TP)
distributed/__init__.py Exports
distributed/impl.py EP NCCL group + all-to-all wrapper
engine/config.py ep_size field
engine/engine.py NCCL init, backend selection, CUDA graph disable, MoE dummy weight fix
layers/moe.py EP weight shapes (fewer full-width experts)
models/weight.py EP expert partitioning (on top of #93)
moe/__init__.py Register EP backend
server/args.py --ep-size flag
docs/features.md EP section

Constraints

  • ep_size must equal tp_size or 1 (they share the same NCCL world group)
  • num_experts % ep_size == 0

Testing

Ran on Qwen3-30B-A3B with --tp 4 --ep-size 4 on 4×H200:

  • Compared greedy outputs (temperature=0, top_k=1) against TP-only on 5 prompts at 32 tokens each — all identical.
  • Verified dummy weight mode (server + API), shell mode (multi-turn with real weights).

Usage

python -m minisgl --model "Qwen/Qwen3-30B-A3B" --tp 4 --ep-size 4

@NikitosKh NikitosKh force-pushed the feature/expert-parallelism branch 2 times, most recently from cb7ca2f to 6a5329a Compare March 6, 2026 10:43
@NikitosKh NikitosKh force-pushed the feature/expert-parallelism branch from 6a5329a to 001b824 Compare March 6, 2026 10:44
@DarkSharpness
Copy link
Collaborator

Could you help resolve the conflict? Thanks for the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants