Run large Mixture-of-Experts models on memory-constrained Macs by loading only router-selected experts on demand from SSD. A 46 GB Qwen3-Coder-Next model runs on a 32 GB Mac at 6-23 tok/s using 19 GB, with coherent output through 1000+ tokens.
git clone https://github.com/mu-hashmi/mlx-moe.git
cd mlx-moe
pip install .For development:
uv syncStart an OpenAI- and Anthropic-compatible API server:
mlx-moe serve mlx-community/Qwen3-Coder-Next-4bitPoint any compatible client at http://127.0.0.1:8080:
# OpenAI-compatible clients
OPENAI_BASE_URL=http://localhost:8080/v1 OPENAI_API_KEY=mlx-moe <your-client>
# Anthropic-compatible clients
ANTHROPIC_BASE_URL=http://localhost:8080 ANTHROPIC_API_KEY=mlx-moe <your-client>Endpoints:
POST /v1/chat/completions— OpenAI chat completions (streaming and non-streaming)POST /v1/messages— Anthropic Messages API (SSE streaming)GET /v1/models— model discovery
Options: --port, --host, --capacity, --profile, --kv-bits.
Sampling parameters (temperature, top_p, top_k) are passed through from the request body. When not specified, the server applies model-recommended defaults (e.g. temp=1.0, top_p=0.95, top_k=40 for Qwen3-Coder). Client values always override.
from mlx_moe import generate
print(generate("mlx-community/Qwen3-Coder-Next-4bit",
"Write a Python hello world program",
max_tokens=200))Streaming:
from mlx_moe import stream_generate
for response in stream_generate("mlx-community/Qwen3-Coder-Next-4bit",
"Write a Flask server", max_tokens=200):
print(response.text, end="", flush=True)Multi-turn sessions:
from mlx_moe import Session
session = Session("mlx-community/Qwen3-Coder-Next-4bit",
cache_dir="~/.cache/mlx-moe")
for response in session.stream("Write a linked list in Python"):
print(response.text, end="", flush=True)
print()
print(session.generate("Now add type hints"))First launch downloads the model (~24 GB) and warms up experts (~13s). Subsequent launches: ~6s to first token.
| Model | Experts | Top-K | MoE Layers | Full Size | mlx-moe Memory | tok/s |
|---|---|---|---|---|---|---|
| Qwen3-Coder-Next-4bit | 512 | 10 | 48 | 46 GB (OOM) | 19.1 GB | 23 |
Any MLX model using SwitchGLU is supported — covers Qwen, Mixtral, GLM, DeepSeek, Hunyuan, PhiMoE, Jamba, OLMoE, MiniMax, and GraniteMoE families. Both stacked and per-expert safetensors formats are handled automatically. Models with complex gates (e.g. GLM's MoEGate) are also supported.
mlx-moe is valuable when a model exceeds your Mac's RAM and has favorable MoE architecture. Three factors predict whether output quality holds up at reduced expert coverage:
- High expert ratio (expert_count / top_k ≥ 10) — more experts means more can stay on SSD
- Shared expert — provides a quality floor when routed experts miss
- Concentrated routing — a small set of "universal" experts handles most tokens; the long tail stays on SSD
| Model | Ratio | Shared Expert | Routing | Verdict |
|---|---|---|---|---|
| Qwen3-Coder-Next (512 exp) | 51x | Yes | 15% universal | Works at 40% coverage, 0% fallback |
| Qwen3-Coder-480B (512 exp) | 51x | Yes | Same arch | Primary target — same arch, ~200 GB |
| Qwen3-235B (128 exp) | 16x | Yes | Same arch | Good target — ~130 GB |
| GLM-4.7-Flash (64 exp) | 16x | Yes | 44% universal | Works, but fits on 32 GB natively |
| Qwen3-30B-A3B (128 exp) | 16x | No | 14% universal | Needs ~75% coverage — fits natively |
| Qwen2-MoE-57B (64 exp) | 8x | Yes | 95% universal | All experts needed — quality degrades |
| Mixtral-8x7B (8 exp) | 4x | No | 100% universal | Ratio too low |
Models without a shared expert collapse to garbage below ~75% expert coverage. Models with uniform routing can't benefit from selective caching. The sweet spot is high ratio + shared expert + concentrated routing.
| System RAM | Capacity | Memory | tok/s | Notes |
|---|---|---|---|---|
| 32 GB | 208 | 19 GB | 8-23 | Recommended |
| 48 GB | 320 | ~28 GB | 15-25 | Estimated |
| 64 GB | 432 | ~37 GB | 20-30 | Estimated |
| 128 GB | 512 | ~44 GB | 30-40 | Full expert coverage |
"Capacity" = experts cached per MoE layer. Auto-selected based on available RAM.
Each MoE layer has hundreds of experts but only routes to a few per token (e.g. 10 out of 512 for Qwen). mlx-moe replaces the expert weight modules with lazy-loading versions that:
- Load the model without expert weights (~1.4 GB instead of 46 GB)
- Discover which experts matter via router-only forward pass (~1s)
- Load only those experts into GPU-resident stacked tensors (~15s)
- Generate with zero-eval dispatch — no
mx.eval()in the forward pass, preserving MLX's async pipeline
Pre-computed profiles in profiles/ identify universally-activated experts per model. These are determined by the model's router weights (same for everyone) and serve two purposes:
- Pinning — universal experts stay in cache permanently, preventing quality degradation at 300+ tokens
- Fast cold start — skip the discovery step entirely (13s instead of 17s)
A profile is shipped for Qwen3-Coder-Next. Generate profiles for other models with:
uv run python benchmarks/profile_experts.py --model mlx-community/Some-MoE-Model-4bit| Config | tok/s | Repetition @ 1000 tokens |
|---|---|---|
| Baseline | 8.8 | 0.20 (degrades at ~250) |
| + Pinning | 8.7 | 0.03 |
| + Wired limit | 10.6 | 0.03 |
| + All optimizations (burst) | ~23 | 0.03 |
| Scenario | Time |
|---|---|
| Warm start (prepacked weights exist) | ~6s |
| Cold start (with profile) | ~13s |
| Cold start (no profile, router discovery) | ~17s |
| Domain switch (delta warmup) | ~2-3s |
Quantize the KV cache to 8-bit to reduce memory for longer contexts. Saves ~45% KV memory with no quality degradation.
text = generate("mlx-community/Qwen3-Coder-Next-4bit",
"Your prompt", kv_bits=8)
# Or via the server
# mlx-moe serve mlx-community/Qwen3-Coder-Next-4bit --kv-bits 8| Context Length | fp16 KV | 8-bit KV |
|---|---|---|
| 4K tokens | 384 MB | ~210 MB |
| 8K tokens | 768 MB | ~420 MB |
| 16K tokens | 1.5 GB | ~820 MB |
| 32K tokens | 3.0 GB | ~1.6 GB |
Uses mlx-lm's QuantizedKVCache with quantized_kv_start=0 (quantize from the first token). Available on all APIs: generate, stream_generate, Session, and mlx-moe serve.
Tested over 20 turns with varied prompts:
- Memory growth: +0.00 GB (19.04 GB flat)
- Speed: 5-20 tok/s (improves as caches warm)
- No KV cache leak, no degradation wall
One-call generation. Auto-selects capacity, loads cached state, applies pinning.
text = generate(
"mlx-community/Qwen3-Coder-Next-4bit",
"Your prompt",
max_tokens=200,
cache_dir="~/.cache/mlx-moe", # persist state between runs
profile_path="profiles/qwen3-coder-next-4bit.json", # enable pinning
kv_bits=8, # quantize KV cache (optional)
)Same as generate but yields GenerationResponse objects token-by-token.
Reusable session for multi-turn generation. Loads the model once, runs delta warmup automatically when prompts change domain.
session = Session("mlx-community/Qwen3-Coder-Next-4bit",
cache_dir="~/.cache/mlx-moe")
session.stream(prompt, max_tokens=200) # yields GenerationResponse
session.generate(prompt, max_tokens=200) # returns str
session.memory_gb # current GPU memory
session.close() # release modelFor full control over the loading pipeline:
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import hf_repo_to_path
from mlx_moe.lazy_experts import (
enable_lazy_experts, upgrade_to_predictive, get_fallback_stats
)
model, tokenizer = mlx_lm.load("mlx-community/Qwen3-Coder-Next-4bit", lazy=True)
model_path = hf_repo_to_path("mlx-community/Qwen3-Coder-Next-4bit")
enable_lazy_experts(model, model_path, cache_capacity_per_layer=208, predictive=True)
mx.eval(model.parameters())
mlx_lm.generate(model, tokenizer, prompt="Hello", max_tokens=10, verbose=False)
upgrade_to_predictive(model, model_path, 208)
output = mlx_lm.generate(model, tokenizer, prompt="Write a Flask server",
max_tokens=200, verbose=False)uv run pytest # 100 tests, ~0.5s (no model download needed)The test suite covers unit tests (ExpertCache, select_capacity, module detection, persistence roundtrips) and integration tests (synthetic 8-expert model through the full lazy → predictive → generate pipeline, skip-fallback with cache misses, golden reference comparison, server endpoints).
Smoke tests against real models are in benchmarks/ and run manually:
uv run python benchmarks/test_model.py mlx-community/Qwen3-Coder-Next-4bit
uv run python benchmarks/test_model.py mlx-community/Qwen3-Coder-Next-4bit --capacity 208 --tokens 50Quality and memory validation across diverse prompts:
uv run python benchmarks/validate_quality.py quality # diverse prompt quality test
uv run python benchmarks/validate_quality.py memory # memory growth over 500 tokensThese take minutes (model download + cold start + generation) and require a Mac with enough RAM for the model.
- Quality cliff below capacity 192 — garbled output regardless of warmup (Qwen)
- Metal pressure cliff above capacity 208 on 32 GB — 3x eval degradation
- Mild repetition at 1000+ tokens — model limitation at 40% expert coverage
- Chat template required for some models (GLM-4.7-Flash)
- Not thread-safe — MLX GPU eval is single-threaded; the server serializes requests
- Semaphore leak warning on Ctrl+C — cosmetic; the
os._exit(0)handler bypasses Python's cleanup because MLX Metal ops block the GIL
MIT