MoMoE: Memory optimized Mixture of Experts

A hyper-performant SwiGLU MoE implementation, optimized for inference, as well as memory efficiency and customizability during training and finetuning.

To read more about MoMoE, check out our blog post

Installation

# Using pip install (can use uv, a recommended pip/conda alternative)
pip install git+https://github.com/tilde-research/MoMoE-impl.git

# OR

# Using git clone and uv sync 
git clone https://github.com/tilde-research/MoMoE-impl.git
cd MoMoE-impl
uv sync
source .venv/bin/activate

Usage

import torch
from MoMoE import MoMoE
momoe = MoMoE(
    embedding_dim=2048, 
    intermediate_dim=1024,
    num_experts=128,
    num_chosen_experts=8,
    save_percent=0,
    Wl1_ND2H=None,
    Wl2_NHD=None,
)

x_BSD = torch.randn((B, S, D), device="cuda", dtype=torch.bfloat16, requires_grad=True)

# the two functions below are not real, just put as placeholders for a router
mask_NM, s_NM = get_expert_token_mask(), get_expert_token_weights()

y_BSD, tokens_per_expert_N = momoe(x_BSD, mask_NM, s_NM)

Features

Customizable amount to save in forward pass (for the backward pass), can be set using save_percent, which should be in the range [0, 100]. If this is set to 0, only the minimum will be saved for the backward, recomputing the rest. This allows for high scalability. At the same time, setting it to 100 will save everything we need for the backward (still less than alternate implementations).
It is also possible to use MoMoE with preallocated weights. Since it is a SwiGLU MoE, we have the first linear layer weights as an N x D x 2H tensor and the second linear layer weights as an N x H x D tensor. Note that these weights must be contiguous in memory for the Triton kernels to work. Expert parallelism (i.e. having different experts on different GPUs) is currently not supported.

Bonus Repo Features

This repository also comes equipped with a TopKRouter class, which is wrapped together with MoMoE into the MoE class. If you look in our test file MoMoE/test.py, this is what we use.
The router is a classic topk + softmax router, equipped with auxiliary loss-free load balancing, as per this DeepSeek paper. Note that you must use a K experts per token router, not a variable number of experts per token router. Otherwise, MoMoE will fail.
Shared experts, can be set using shared_experts in our router. These are experts which are assigned to all tokens, but must also be set in mask_NM and s_NM accordingly, as a full row of values for the last shared_experts experts (it will not work as expected unless the shared experts are the last ones)

Acknowledgments

This implementation would not be possible without the wonders of torch and triton. We thank the PyTorch and Triton teams for everything they have done to help the AI community.

Our implementation is written to fit inside the Flash‑Linear‑Attention project, and we thank the contributors of FLA for their work. We also extend our gratitude to the developers of TorchTitan for providing a platform for LLM pre‑training.

For testing, we thank the Qwen team for the open-sourced MoE models tested. We also wish to thank the Megatron LM, ScatterMoE, and MegaBlocks authors for pioneering the landscape of open-source MoE kernels.

We hope you enjoy our hyper-performant MoE!

@article{costin2025momoe,
    title={MoMoE: Memory optimized Mixture of Experts},
    author={Costin, Bobby and Averbuch, Timor and Pai, Dhruv and Chen, Nathan and Keigwin, Ben},
    journal={Tilde Research Blog},
    year={2025},
    month={7},
    url={https://www.tilderesearch.com/blog/momoe},
    note={Blog post}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
momoe		momoe
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoMoE: Memory optimized Mixture of Experts

Installation

Usage

Features

Bonus Repo Features

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

tilde-research/MoMoE-impl

Folders and files

Latest commit

History

Repository files navigation

MoMoE: Memory optimized Mixture of Experts

Installation

Usage

Features

Bonus Repo Features

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages