ExpertRT is a focused C++/CUDA runtime for Mixture-of-Experts (MoE) routing primitives: token-to-expert assignment, dispatch packing, and output combine.
The repository is designed as a systems component that can plug into PyTorch training and inference pipelines while remaining modular for future multi-GPU and distributed routing work.
Sparse MoE layers reduce per-token compute by activating only a subset of experts. The routing path becomes a critical system bottleneck:
- token -> expert assignment (top-k)
- metadata materialization
- token packing into expert-major buffers
- load accounting and imbalance visibility
- weighted combine back to token order
ExpertRT targets this path with explicit memory layout contracts and benchmarkable primitives.
- Route primitive (
route) that returns top-k assignments and compact routing metadata. - Dispatch primitive (
dispatch) that packs token rows into expert-major contiguous buffers. - Combine primitive (
combine) that applies routing weights and reduces back to token-major output. - Optional PyTorch C++/CUDA extension with Python fallback reference path.
- Benchmark scripts for dispatch, combine, and end-to-end routing path.
- Test suite validating correctness and common edge cases.
- C++/CUDA core: API contracts and extension entry points in
include/,src/, andcuda/. - Python package: stable runtime API in
python/expertrt/ops.pywith readable reference implementation inpython/expertrt/reference.py. - Benchmarks: runtime-vs-reference comparisons in
benchmarks/. - Docs: architecture, memory layout, benchmark methodology, and roadmap in
docs/.
pip install -e .- If a CUDA toolchain is available, setup builds the extension with CUDA sources.
- Otherwise it builds a CPU extension and keeps the same Python API.
cmake -S . -B build
cmake --build build -jimport torch
import expertrt
tokens = torch.randn(4096, 1024, device="cuda")
logits = torch.randn(4096, 16, device="cuda")
route_out = expertrt.route(logits, top_k=2)
dispatch_out = expertrt.dispatch(tokens, route_out)
# expert model compute would consume dispatch_out.packed_tokens
expert_outputs = dispatch_out.packed_tokens
combined = expertrt.combine(expert_outputs, dispatch_out, route_out, num_tokens=tokens.shape[0])python benchmarks/bench_dispatch.py --tokens 4096 --hidden 4096 --experts 16 --top-k 2
python benchmarks/bench_combine.py --tokens 4096 --hidden 4096 --experts 16 --top-k 2
python benchmarks/bench_end_to_end.py --tokens 8192 --hidden 4096 --experts 32 --top-k 2See docs/benchmarks.md for methodology and interpretation.
pytest testsinclude/expertrt/ # public runtime headers
src/ # C++ routing/dispatch/combine + bindings
cuda/ # CUDA kernel files and utility headers
python/expertrt/ # Python API, reference implementation, benchmark helpers
benchmarks/ # runnable benchmark scripts
tests/ # correctness and smoke tests
docs/ # architecture, memory layout, benchmarks, roadmap
- CUDA-specific custom kernels are scaffolded; v1 execution relies on ATen primitives (
topk,sort,bincount, gather/scatter). - Single-device routing semantics only; no cross-device token exchange yet.
- No fused
route+dispatchkernel in v1.
- Fused route+dispatch kernels and vectorized combine.
- Better load-balancing diagnostics and optional token-cap policies.
- Distributed and multi-GPU expert routing plans.
- Integration hooks for full MoE training stacks.
See docs/roadmap.md for details.