[Feature] Expert parallelism support for MoE models by NikitosKh · Pull Request #96 · sgl-project/mini-sglang

NikitosKh · 2026-03-06T10:10:08Z

This adds expert parallelism (EP) for MoE models. The approach follows SGLang's EP design — dispatch tokens to the rank that owns the target expert via all-to-all, compute locally using the existing fused MoE kernels from #59, then combine results back.

Depends on #93 for the streaming weight loader (which also handles EP expert partitioning).

How it works

Instead of TP-sharding every expert's intermediate dimension, EP gives each rank num_experts / ep_size complete experts. The forward pass for each MoE layer then does:

Run the replicated gate to get top-k expert IDs (same as before)
Sort token-expert pairs by destination rank, exchange via all_to_all_single
Run fused_experts_impl locally on received tokens
Send results back via another all-to-all
Un-permute and apply routing weights

No new kernels — steps 1 and 3 reuse fused_topk and fused_experts_impl directly.

A few things worth calling out:

When EP is active, NCCL is used as the main process group backend (not gloo). I tried creating a separate NCCL subgroup on top of gloo, but that causes a ~72 GiB memory imbalance across ranks from NCCL's internal buffer allocation. Using NCCL as the world group and creating a gloo subgroup for CPU coordination avoids this entirely.
PyNCCL still gets layered on top for TP all-reduce, so attention layers use the same fast path as before.
CUDA graphs are auto-disabled since the all-to-all token counts vary per batch.

What changed

1 new file, 9 modified, 1 doc update:

File	What
`moe/ep.py`	New EP backend
`distributed/info.py`	EP rank/size (same pattern as TP)
`distributed/__init__.py`	Exports
`distributed/impl.py`	EP NCCL group + all-to-all wrapper
`engine/config.py`	`ep_size` field
`engine/engine.py`	NCCL init, backend selection, CUDA graph disable, MoE dummy weight fix
`layers/moe.py`	EP weight shapes (fewer full-width experts)
`models/weight.py`	EP expert partitioning (on top of #93)
`moe/__init__.py`	Register EP backend
`server/args.py`	`--ep-size` flag
`docs/features.md`	EP section

Constraints

ep_size must equal tp_size or 1 (they share the same NCCL world group)
num_experts % ep_size == 0

Testing

Ran on Qwen3-30B-A3B with --tp 4 --ep-size 4 on 4×H200:

Compared greedy outputs (temperature=0, top_k=1) against TP-only on 5 prompts at 32 tokens each — all identical.
Verified dummy weight mode (server + API), shell mode (multi-turn with real weights).

Usage

python -m minisgl --model "Qwen/Qwen3-30B-A3B" --tp 4 --ep-size 4

DarkSharpness · 2026-03-11T18:31:01Z

Could you help resolve the conflict? Thanks for the PR

[Fix] Streaming weight loader to fix OOM with tensor parallelism

a838709

NikitosKh force-pushed the feature/expert-parallelism branch 2 times, most recently from cb7ca2f to 6a5329a Compare March 6, 2026 10:43

[Feature] Expert parallelism support for MoE models

001b824

NikitosKh force-pushed the feature/expert-parallelism branch from 6a5329a to 001b824 Compare March 6, 2026 10:44

Merge upstream/main: resolve conflicts with EP support

60520ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Expert parallelism support for MoE models#96

[Feature] Expert parallelism support for MoE models#96
NikitosKh wants to merge 3 commits intosgl-project:mainfrom
NikitosKh:feature/expert-parallelism

NikitosKh commented Mar 6, 2026

Uh oh!

DarkSharpness commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NikitosKh commented Mar 6, 2026

How it works

What changed

Constraints

Testing

Usage

Uh oh!

DarkSharpness commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants