Skip to content

Sunnylincc/ExpertRT

Repository files navigation

ExpertRT

ExpertRT is a focused C++/CUDA runtime for Mixture-of-Experts (MoE) routing primitives: token-to-expert assignment, dispatch packing, and output combine.

The repository is designed as a systems component that can plug into PyTorch training and inference pipelines while remaining modular for future multi-GPU and distributed routing work.

Why expert routing matters

Sparse MoE layers reduce per-token compute by activating only a subset of experts. The routing path becomes a critical system bottleneck:

  1. token -> expert assignment (top-k)
  2. metadata materialization
  3. token packing into expert-major buffers
  4. load accounting and imbalance visibility
  5. weighted combine back to token order

ExpertRT targets this path with explicit memory layout contracts and benchmarkable primitives.

V1 features

  • Route primitive (route) that returns top-k assignments and compact routing metadata.
  • Dispatch primitive (dispatch) that packs token rows into expert-major contiguous buffers.
  • Combine primitive (combine) that applies routing weights and reduces back to token-major output.
  • Optional PyTorch C++/CUDA extension with Python fallback reference path.
  • Benchmark scripts for dispatch, combine, and end-to-end routing path.
  • Test suite validating correctness and common edge cases.

Architecture overview

  • C++/CUDA core: API contracts and extension entry points in include/, src/, and cuda/.
  • Python package: stable runtime API in python/expertrt/ops.py with readable reference implementation in python/expertrt/reference.py.
  • Benchmarks: runtime-vs-reference comparisons in benchmarks/.
  • Docs: architecture, memory layout, benchmark methodology, and roadmap in docs/.

Build and install

Python package (recommended)

pip install -e .
  • If a CUDA toolchain is available, setup builds the extension with CUDA sources.
  • Otherwise it builds a CPU extension and keeps the same Python API.

CMake (library-oriented)

cmake -S . -B build
cmake --build build -j

Quick usage

import torch
import expertrt

tokens = torch.randn(4096, 1024, device="cuda")
logits = torch.randn(4096, 16, device="cuda")

route_out = expertrt.route(logits, top_k=2)
dispatch_out = expertrt.dispatch(tokens, route_out)

# expert model compute would consume dispatch_out.packed_tokens
expert_outputs = dispatch_out.packed_tokens
combined = expertrt.combine(expert_outputs, dispatch_out, route_out, num_tokens=tokens.shape[0])

Benchmarks

python benchmarks/bench_dispatch.py --tokens 4096 --hidden 4096 --experts 16 --top-k 2
python benchmarks/bench_combine.py --tokens 4096 --hidden 4096 --experts 16 --top-k 2
python benchmarks/bench_end_to_end.py --tokens 8192 --hidden 4096 --experts 32 --top-k 2

See docs/benchmarks.md for methodology and interpretation.

Tests

pytest tests

Repository layout

include/expertrt/    # public runtime headers
src/                 # C++ routing/dispatch/combine + bindings
cuda/                # CUDA kernel files and utility headers
python/expertrt/     # Python API, reference implementation, benchmark helpers
benchmarks/          # runnable benchmark scripts
tests/               # correctness and smoke tests
docs/                # architecture, memory layout, benchmarks, roadmap

Current limitations

  • CUDA-specific custom kernels are scaffolded; v1 execution relies on ATen primitives (topk, sort, bincount, gather/scatter).
  • Single-device routing semantics only; no cross-device token exchange yet.
  • No fused route+dispatch kernel in v1.

Roadmap

  • Fused route+dispatch kernels and vectorized combine.
  • Better load-balancing diagnostics and optional token-cap policies.
  • Distributed and multi-GPU expert routing plans.
  • Integration hooks for full MoE training stacks.

See docs/roadmap.md for details.

About

A high-performance runtime for expert routing in sparse neural networks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors