ExpertRT

ExpertRT is a focused C++/CUDA runtime for Mixture-of-Experts (MoE) routing primitives: token-to-expert assignment, dispatch packing, and output combine.

The repository is designed as a systems component that can plug into PyTorch training and inference pipelines while remaining modular for future multi-GPU and distributed routing work.

Why expert routing matters

Sparse MoE layers reduce per-token compute by activating only a subset of experts. The routing path becomes a critical system bottleneck:

token -> expert assignment (top-k)
metadata materialization
token packing into expert-major buffers
load accounting and imbalance visibility
weighted combine back to token order

ExpertRT targets this path with explicit memory layout contracts and benchmarkable primitives.

V1 features

Route primitive (route) that returns top-k assignments and compact routing metadata.
Dispatch primitive (dispatch) that packs token rows into expert-major contiguous buffers.
Combine primitive (combine) that applies routing weights and reduces back to token-major output.
Optional PyTorch C++/CUDA extension with Python fallback reference path.
Benchmark scripts for dispatch, combine, and end-to-end routing path.
Test suite validating correctness and common edge cases.

Architecture overview

C++/CUDA core: API contracts and extension entry points in include/, src/, and cuda/.
Python package: stable runtime API in python/expertrt/ops.py with readable reference implementation in python/expertrt/reference.py.
Benchmarks: runtime-vs-reference comparisons in benchmarks/.
Docs: architecture, memory layout, benchmark methodology, and roadmap in docs/.

Build and install

Python package (recommended)

pip install -e .

If a CUDA toolchain is available, setup builds the extension with CUDA sources.
Otherwise it builds a CPU extension and keeps the same Python API.

CMake (library-oriented)

cmake -S . -B build
cmake --build build -j

Quick usage

import torch
import expertrt

tokens = torch.randn(4096, 1024, device="cuda")
logits = torch.randn(4096, 16, device="cuda")

route_out = expertrt.route(logits, top_k=2)
dispatch_out = expertrt.dispatch(tokens, route_out)

# expert model compute would consume dispatch_out.packed_tokens
expert_outputs = dispatch_out.packed_tokens
combined = expertrt.combine(expert_outputs, dispatch_out, route_out, num_tokens=tokens.shape[0])

Benchmarks

python benchmarks/bench_dispatch.py --tokens 4096 --hidden 4096 --experts 16 --top-k 2
python benchmarks/bench_combine.py --tokens 4096 --hidden 4096 --experts 16 --top-k 2
python benchmarks/bench_end_to_end.py --tokens 8192 --hidden 4096 --experts 32 --top-k 2

See docs/benchmarks.md for methodology and interpretation.

Tests

pytest tests

Repository layout

include/expertrt/    # public runtime headers
src/                 # C++ routing/dispatch/combine + bindings
cuda/                # CUDA kernel files and utility headers
python/expertrt/     # Python API, reference implementation, benchmark helpers
benchmarks/          # runnable benchmark scripts
tests/               # correctness and smoke tests
docs/                # architecture, memory layout, benchmarks, roadmap

Current limitations

CUDA-specific custom kernels are scaffolded; v1 execution relies on ATen primitives (topk, sort, bincount, gather/scatter).
Single-device routing semantics only; no cross-device token exchange yet.
No fused route+dispatch kernel in v1.

Roadmap

Fused route+dispatch kernels and vectorized combine.
Better load-balancing diagnostics and optional token-cap policies.
Distributed and multi-GPU expert routing plans.
Integration hooks for full MoE training stacks.

See docs/roadmap.md for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
cmake		cmake
cuda		cuda
docs		docs
include/expertrt		include/expertrt
python/expertrt		python/expertrt
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExpertRT

Why expert routing matters

V1 features

Architecture overview

Build and install

Python package (recommended)

CMake (library-oriented)

Quick usage

Benchmarks

Tests

Repository layout

Current limitations

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ExpertRT

Why expert routing matters

V1 features

Architecture overview

Build and install

Python package (recommended)

CMake (library-oriented)

Quick usage

Benchmarks

Tests

Repository layout

Current limitations

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages