🧠 MELLM — Lightweight Modular AI Routing Engine

Run multiple small, domain-specialized LLMs instead of one massive general model.
Better answers. Less VRAM. Faster responses.

⚡ TL;DR

MELLM is a lightweight AI router that:

Uses a small model to classify queries with high precision.
Routes them to domain-specific experts (code, math, medical, legal).
Conserves VRAM by keeping only one expert model active at a time.

👉 Better accuracy than small general models
👉 Runs on 6GB GPUs (RTX 3050, GTX 1060)
👉 No more waiting for massive 70B models to layer-stream

🔍 What is MELLM?

Large general-purpose models are trained to know everything — which means they're not optimised for anything in particular. A 70B general model answering a medical question is overkill on resources and often outperformed by a 7B model that was fine-tuned specifically on medical literature.

Meanwhile, running models larger than 14B is completely out of reach on consumer hardware like a 6GB GPU. You're stuck choosing between a small general model that gives mediocre answers, or a large specialist model you can't run.

MELLM solves this differently.

Instead of running one large model, MELLM runs a tiny router model that reads your query, identifies the domain, and loads the right specialist model — a small model fine-tuned specifically for that task. When you ask a medical question, you get a medical model. When you ask for code, you get a code model. When your query spans multiple domains, MELLM decomposes it and runs each part through the appropriate specialist simultaneously.

The result: expert-level answers from models that fit on your GPU, at a fraction of the compute cost of running a monolithic large model.

User query → Router (1.5B) → identifies domain → loads specialist (1.5–7B)
                                                 → generates response
                                                 → unloads, ready for next query

This architecture isn't just a consumer hardware workaround — it's a fundamentally more efficient approach. The same principle applies at production scale: routing to purpose-built specialists is cheaper and more accurate than throwing a 70B general model at every query.

The numbers

Approach	Model needed	VRAM	Query time	Domain accuracy
General monolithic	14B–70B	16–80 GB	minutes	moderate
MELLM specialist routing	1.5B–7B	6 GB	8–20s	high

A domain-specific 7B model can often outperform a larger general model on tasks within its specialty — while using significantly less compute.

🧠 Traditional vs MELLM

Feature	Traditional LLM	MELLM
Models used	1 large monolithic model	Multiple small experts
VRAM usage	Always high (OOM risk)	Minimal (fit on consumer GPUs)
Accuracy	Generalist / Average	Domain-optimized / Expert
Efficiency	Low (brute-force scaling)	High (intelligent routing)

💡 Why this matters

Running large LLMs locally is expensive and often impractical. MELLM shows that:

You don’t need a single massive model to get expert answers.
You can orchestrate experts to achieve better results with less hardware.
Efficient routing beats brute-force scaling every time.

This opens the door to local AI systems on consumer hardware, modular architectures that grow with your needs, and lower-cost deployment at scale.

🎯 MELLM vs Mixture-of-Experts — What's Different?

You may have heard of Mixture-of-Experts (MoE) — the architecture behind models like Mixtral and GPT-4. The concept is similar: route inputs to specialized sub-networks. But production MoE has critical limitations that MELLM directly addresses.

	Production MoE (Mixtral, GPT-4)	MELLM
Access	Closed, trained by big labs	Fully open — swap any GGUF model
Hardware	Requires multi-GPU clusters	Runs on an RTX 3050 or even CPU
Experts	Fixed at training time	Plug-and-play — add new domains anytime
Routing	Statistical (trained gating network)	LLM-based — actually understands your query
Focus	Research / production API	Developer tool / local framework

🧩 Think: "App Store for LLM Experts"

MELLM's open architecture lets anyone contribute a new specialist:

Add a finance model for stock analysis
Add a science model for research queries
Add a language model for translation tasks

No retraining, no new architecture — just register it in GGUF_REGISTRY and it's immediately available as a domain expert.

🔧 LLM-as-Router — A Cutting-Edge Twist

Most open-source routers use keyword matching or fine-tuned classifiers. MELLM uses an LLM in JSON-output mode to both classify the domain and rewrite the prompt for optimal specialist performance. This "LLM as router" approach is an active research direction — MELLM brings it to a practical, runnable implementation.

Honest positioning: MELLM is not a new concept. But open, modular, low-hardware MoE routing is genuinely underserved in the open-source ecosystem. That's the gap this fills.

Key Features

🔀 Intelligent Routing — A 1.5B router classifies queries into 5 domains with >90% accuracy and rewrites prompts for optimal specialist output
🧩 Multi-Agent Composition — Cross-domain queries are automatically decomposed, each part routed to the right specialist, and merged into one response
🧬 Domain Specialists — Fine-tuned models for medical, legal, math, code, and general knowledge — each optimised for its task
⚡ Persistent Router — Router stays resident in VRAM; zero routing overhead after startup
🔥 Hot Specialist Cache — Active specialist stays loaded between same-domain queries; only swapped on domain switch
🧠 Conversation Context — 3-turn history window so follow-up queries like "Now in Python?" work correctly
🎯 Domain Continuity — Short follow-ups inherit the current domain automatically
🖥️ Interactive Setup Wizard — Hardware-aware onboarding detects your GPU and recommends appropriate model sizes
🌐 REST API — FastAPI endpoint so any app can use MELLM as a backend
⬇️ Auto-Download — Models download from Hugging Face on first use, cached locally

🏗️ Architecture

                      ┌──────────────────────────────────────┐
                      │   STARTUP — once per session         │
                      │   Router (1.5B Qwen) loads into VRAM │
                      └──────────────────┬───────────────────┘
                                         │ stays resident ↓
┌──────────────┐    ┌─────────────┐    ┌─────────────────────┐
│  User Query  │───▶│  Add context│───▶│  Router: Classify   │
│              │    │  (last 3    │    │  domain + rewrite   │
│              │    │   turns)    │    │  prompt (JSON mode) │
└──────────────┘    └─────────────┘    └──────────┬──────────┘
                                                   │
                               domain continuity   │
                               bias applied here   │
                                                   ▼
              ┌────────────────────────────────────────────────┐
              │  Hot Cache Check                               │
              │  Same domain? → reuse loaded specialist (0s)   │
              │  New domain?  → unload old, load new (1-6s)    │
              └───────────────────┬────────────────────────────┘
                                   │
          ┌────────────────────────▼──────────────────────────┐
          │  Specialist generates response (stays in VRAM)    │
          │  History appended → available for next query      │
          └───────────────────────────────────────────────────┘

VRAM Strategy

MELLM uses a two-tier residency model to maximize responsiveness within 6GB VRAM:

Model	Residency	Why
Router (1.5B)	Always in VRAM	~1 GB, permanent resident; no per-query overhead
Active Specialist	In VRAM until domain changes	Only one specialist at a time; swapped on domain switch
Other Specialists	On disk (GGUF cache)	Loaded on-demand in 1-6s

📋 Model Registry

All models use the GGUF quantized format for efficient inference via llama-cpp-python.

Role	Domain	Model	GGUF File	Size	Context
Router	All	Qwen2.5-1.5B-Instruct	`qwen2.5-1.5b-instruct-q4_k_m.gguf`	~1 GB	4096
Specialist	Code	Qwen2.5-Coder-1.5B-Instruct	`qwen2.5-coder-1.5b-instruct-q4_k_m.gguf`	~1.1 GB	4096
Specialist	Math	Qwen2.5-Math-1.5B-Instruct	`Qwen2.5-Math-1.5B-Instruct-Q4_K_M.gguf`	~986 MB	4096
Specialist	Medical	BioMistral-7B-DARE	`ggml-model-Q2_K.gguf`	~2.3 GB	1024
Specialist	Legal	Magistrate-3.2-3B-IT	`magistrate-3.2-3b-it.Q4_K_M.gguf`	~1.8 GB	1024
Specialist	General	Qwen2.5-1.5B-Instruct	`qwen2.5-1.5b-instruct-q4_k_m.gguf`	~1.1 GB	4096

Note: Larger models (7B+) automatically use a reduced context window (1024 tokens) to stay within 6GB VRAM limits. Smaller models (≤1.5B) use the full 4096 context.

📁 Project Structure

MELLM/
├── cli.py                    # Rich terminal interface with session efficiency UI
├── api.py                    # FastAPI REST server
├── orchestrator.py           # Core pipeline: persistent router, hot cache, conversation history
├── config.yaml               # Model IDs, token limits, and specialist configuration
├── requirements.txt          # Python dependencies
├── .env                      # Environment variables (HF_TOKEN)
│
├── loader/
│   ├── __init__.py
│   └── airllm_loader.py      # GGUF model loader with auto-download and VRAM management
│
├── router/
│   ├── __init__.py
│   ├── classifier.py         # LLM-based query classifier (JSON output mode)
│   └── prompt_optimizer.py   # Rule-based fallback prompt templates
│
├── webui/
│   └── index.html            # Single-file web UI (served by FastAPI at /)
│
└── specialists/
    ├── __init__.py
    ├── base_specialist.py    # Abstract base class for all specialists
    ├── code.py               # Code generation specialist (temp=0.1)
    ├── math_specialist.py    # Math problem solver (temp=0.1)
    ├── medical.py            # Medical Q&A with system prompt (temp=0.7)
    ├── legal.py              # Legal information specialist
    └── general.py            # General knowledge fallback

⚙️ Hardware Requirements

Component	Minimum	Recommended
GPU	NVIDIA GTX 1060 (6GB)	NVIDIA RTX 3050+ (6GB+)
VRAM	6 GB	8 GB+
RAM	8 GB	16 GB+
Storage	10 GB free	20 GB+ free
CUDA	11.7+	12.1+
Python	3.10+	3.11+
OS	Linux (Ubuntu 22.04+)	Linux with NVIDIA drivers

Hardware Scalability

MELLM works with any GGUF model on HuggingFace. The setup wizard automatically recommends model sizes based on your GPU. Users with larger GPUs can run 13B, 34B, or 70B specialists by simply updating GGUF_REGISTRY and config.yaml — no code changes needed.

🚀 Getting Started

Linux

1. Clone the Repository

git clone https://github.com/Rahul-14507/MELLM
cd MELLM

2. Create a Virtual Environment

python -m venv .venv
source .venv/bin/activate

3. Install llama-cpp-python (with CUDA)

⚠️ Important: llama-cpp-python must be installed separately first with CUDA wheels. Installing it via pip install -r requirements.txt alone will NOT enable GPU acceleration.

# For CUDA 12.1+ (most modern NVIDIA GPUs)
pip install llama-cpp-python \
    --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

# For CUDA 11.8 (older GPUs)
pip install llama-cpp-python \
    --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu118

Verify GPU support:

python -c "from llama_cpp import Llama; print('llama-cpp-python installed successfully')"

4. Install Remaining Dependencies

pip install -r requirements.txt

5. Set Up Environment Variables

cp .env.example .env
# Or create .env manually:
echo "HF_TOKEN=your_huggingface_token_here" > .env

You need a Hugging Face token to download models. Some models may require accepting their license on the Hugging Face model page first.

6. Run MELLM

# Interactive CLI mode
python cli.py

# Pre-download all models at once
python cli.py --preload all

# Pre-download a specific domain
python cli.py --preload medical

# Web UI Mode
python cli.py --web
# Opens http://localhost:8000 in your browser automatically

Interactive Web Features:

Live Streaming: Tokens appear instantly with polished animations.
Expert Panels: Multi-agent results stream into dedicated parallel panels.
Domain Indicators: Visual badges identify which specialist is responding.
Stop Generation: Interrupt long responses instantly with the "Stop" button.
One-Click Regenerate: Easily retry any query to get a fresh response.
Robust Code Rendering: Proper syntax highlighting with headers and one-click "Copy" buttons.
Session Dashboard: Live sidebar showing VRAM status, cache hits, and time saved.

Windows (via WSL2)

MELLM runs natively on Linux. Windows users can run it with full GPU acceleration through WSL2 (Windows Subsystem for Linux 2), which gives you a real Ubuntu environment with direct access to your NVIDIA GPU.

Prerequisites:

Windows 10 (Build 19041+) or Windows 11
NVIDIA GPU with 6GB+ VRAM
NVIDIA driver 470.76+ installed on Windows (not inside WSL — the Windows driver handles the bridge)

Step 1 — Enable WSL2

Open PowerShell as Administrator and run:

wsl --install

This installs WSL2 with Ubuntu 22.04 by default. Restart when prompted.

If you already have WSL1, upgrade to WSL2:

wsl --set-default-version 2
wsl --install -d Ubuntu-22.04

Verify WSL2 is active:

wsl --list --verbose
# Should show VERSION 2

Step 2 — Install NVIDIA CUDA inside WSL2

Important: Do NOT install the NVIDIA driver inside WSL2. The Windows driver already handles GPU access. You only need the CUDA toolkit inside WSL2.

Open your Ubuntu WSL2 terminal and run:

# Remove any existing CUDA GPG keys to avoid conflicts
sudo apt-key del 7fa2af80

# Add NVIDIA's WSL2-specific CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.0-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.0-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-1

Verify CUDA is working:

nvidia-smi
# Should show your GPU name and VRAM

Step 3 — Install Python 3.11

sudo apt update
sudo apt install software-properties-common -y
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update
sudo apt install python3.11 python3.11-venv python3.11-dev -y

# Verify
python3.11 --version

Step 4 — Clone and set up MELLM

# Navigate to a good location (WSL2 filesystem, not /mnt/c — see note below)
cd ~
git clone https://github.com/Rahul-14507/MELLM
cd MELLM

# Create venv
python3.11 -m venv .venv
source .venv/bin/activate

# Install llama-cpp-python with CUDA support
pip install llama-cpp-python \
    --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

# Verify GPU is detected
python -c "from llama_cpp import Llama; import ctypes; print('OK')"

# Install remaining dependencies
pip install -r requirements.txt

Step 5 — Run MELLM

source .venv/bin/activate
python cli.py

The setup wizard will detect your GPU and recommend appropriate model sizes.

⚠️ Important Notes for Windows Users

Store your project on the WSL2 filesystem, not Windows drives.

# Good — fast WSL2 filesystem
~/MELLM/

# Avoid — slow cross-filesystem access, causes I/O bottlenecks
/mnt/c/Users/YourName/MELLM/
/mnt/d/Projects/MELLM/

Model downloads go to ~/.cache/huggingface/ inside WSL2 by default. This is fine. If you want them on a Windows drive with more space, symlink it:

mkdir -p /mnt/d/mellm-models
ln -s /mnt/d/mellm-models ~/.cache/mellm_gguf

VRAM headroom on Windows is ~500MB–1GB less than native Linux because Windows desktop processes and WSL2 overhead occupy some GPU memory. On a 6GB card:

Setup	Effective VRAM for MELLM
Native Linux	~5.5 GB
WSL2 on Windows	~4.5–5.0 GB

If you hit Failed to create llama_context errors, switch to smaller/more quantized models in the setup wizard (choose Q2_K variants for 7B models, or stick to 1.5B–3B specialists).

Accessing MELLM's API from Windows browsers/apps:

The FastAPI server inside WSL2 is accessible from Windows at http://localhost:8000 by default — WSL2 automatically bridges the network. No extra configuration needed.

# Inside WSL2
python api.py

# From Windows browser or app
curl http://localhost:8000/health

WSL2 Troubleshooting

nvidia-smi not found inside WSL2: Update your Windows NVIDIA driver to 470.76+. The WSL2 GPU bridge requires a recent driver on the Windows side.

CUDA error: no kernel image is available during inference: Your CUDA toolkit version doesn't match your driver. Re-run Step 2 with the CUDA version that matches your driver (check nvidia-smi on Windows for the CUDA version).

Slow model downloads: WSL2 DNS can be slow. Add Google's DNS to /etc/resolv.conf:

echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

Out of memory after loading medical model: Windows is using more VRAM than expected. Close GPU-heavy Windows apps (Chrome with hardware acceleration, games, etc.) before running MELLM, or switch to the Q2_K medical model in user_config.yaml.

💻 Usage

⚡ Quick Demo

python cli.py

Try these queries to see the router in action:

"Explain binary search in Java"
"Now in Python"
"What is the time complexity?"

CLI Mode

python cli.py

On startup, MELLM loads the router model and displays a Model Availability Dashboard showing which specialists are cached. The router shows Loaded (persistent) — it's already in VRAM before your first query.

Query: Binary Search in Java

Domain: CODE
╭─── Response (Specialist: code) ───╮
│ Here is a Java implementation...  │
╰───────────────────────────────────╯
Metrics: Router: resident (0s) | Specialist Load: 1.63s | Inference: 8.42s | Context: 1 turns
╭──────────────── ⚡ Efficiency ────────────────╮
│ Queries this session : 1                       │
│ Specialist cache hits: 0/1 (0%)               │
│ Router loads saved   : 0 (~0.0s saved)        │
│ Active specialist    : CODE (freshly loaded)  │
│ Context turns active : 1/3                    │
╰───────────────────────────────────────────────╯

Query: Now in Python?
  → Domain continuity: short follow-up, keeping 'code'
  → Cache hit — reusing code specialist (0s)
╭─── Response (Specialist: code) ───╮
│ Here is the Python equivalent...  │
╰───────────────────────────────────╯
Metrics: Router: resident (0s) | Specialist Load: 0s | Inference: 7.1s | Context: 2 turns

CLI Commands:

Command	Effect
(any query)	Routes through MoE pipeline
`clear`	Wipes conversation history, starts fresh
`exit` / `quit`	Cleanly unloads all models from VRAM and exits

Multi-Agent Queries

MELLM automatically detects when a query spans multiple domains and routes sub-tasks to the appropriate specialists, then merges the results into a single coherent response:

Query: "Explain binary search AND give Java code AND analyse its complexity"

→ GENERAL specialist: conceptual explanation
→ CODE specialist:    Java implementation
→ MATH specialist:    O(log n) complexity analysis
→ Merged into a single response

This happens transparently — just ask your cross-domain question and MELLM handles the rest.

API Mode

python api.py
# Server starts at http://0.0.0.0:8000

Endpoints:

Method	Endpoint	Description
`GET`	`/health`	Health check and router status
`POST`	`/query`	Process a query through the router pipeline

Example Request:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What are the symptoms of appendicitis?"}'

Example Response:

{
  "original_prompt": "What are the symptoms of appendicitis?",
  "domain": "medical",
  "confidence": 0.95,
  "rewritten_prompt": "...",
  "response": "The common symptoms of appendicitis include...",
  "router_load_time": 0.0,
  "specialist_load_time": 10.81,
  "inference_time_seconds": 5.23,
  "cache_hit": false,
  "context_turns": 1
}

Preloading Models

Pre-download models so they're ready instantly:

# Download ALL specialist models
python cli.py --preload all

# Download only the medical specialist
python cli.py --preload medical

⚡ Performance & Benchmarks

Benchmarked on NVIDIA GeForce RTX 3050 6GB Laptop GPU · CUDA 12.1 · Python 3.11

Model Load & Inference

Stage	1.5B Models	3B Models	7B Models (Q2_K)
Model Load (first time)	~1-2s	~3-4s	~5-6s
Model Load (cache hit)	0s	0s	0s
Inference Speed	15-25 tok/s	10-15 tok/s	8-12 tok/s
VRAM Usage	~1.5 GB	~2.5 GB	~3-4 GB
Context Window	4096 tokens	1024 tokens	1024 tokens

End-to-End	1st Query	2nd Query (Same Domain)	2nd Query (Domain Switch)
Router overhead	0s (persistent)	0s (persistent)	0s (persistent)
Specialist load	1-6s	0s (hot cache)	1-6s (new domain)
Inference (1.5B)	~5-15s	~5-15s	~5-15s
Total typical query	~6-20s	~5-15s	~6-20s

Routing Accuracy

Tested across 25 queries (5 per domain):

Domain	Correct	Accuracy
Code	4/5	80%
Math	5/5	100%
Medical	5/5	100%
Legal	5/5	100%
General	4/5	80%
Overall	22/25	88%

Misclassified queries were genuinely ambiguous (e.g. "Debug this segmentation fault in C++" → classified as general; "Explain Occam's Razor" → classified as math). Core domain classification is 100% accurate for unambiguous queries.

End-to-End Latency (per domain)

Domain	Model	Cold Load	Hot Cache	Inference	Total (cold)
Code	Qwen2.5-Coder-1.5B	~3.4s	0.0s	~7.2s	~17s
Math	Qwen2.5-Math-1.5B	~3.8s	0.0s	~9.5s	~20s
Medical	BioMistral-7B Q2	~6.3s	0.0s	~18.6s	~32s
Legal	Magistrate-3B	~5.8s	0.0s	~18.5s	~31s
General	Qwen2.5-1.5B	~0.0s*	0.0s	~2.3s	~10s

* General specialist was already cached from router (same base model)

Router overhead: 0s — persistent in VRAM, never reloaded between queries.

Hot cache = 0s load — same-domain follow-up queries skip loading entirely. Consecutive queries in a coding session: first query ~17s, all subsequent ~12s.

🔧 Configuration

All model and generation settings are in config.yaml:

router:
  model_id: "Qwen/Qwen2.5-1.5B-Instruct"
  max_new_tokens: 256

specialists:
  code:
    model_id: "Qwen/Qwen2.5-Coder-1.5B-Instruct"
    max_new_tokens: 2048
  math:
    model_id: "Qwen/Qwen2.5-Math-1.5B-Instruct"
    max_new_tokens: 2048
  medical:
    model_id: "BioMistral/BioMistral-7B-DARE-GGUF"
    max_new_tokens: 2048
  legal:
    model_id: "AdaptLLM/law-LLM"
    max_new_tokens: 2048
  general:
    model_id: "Qwen/Qwen2.5-1.5B-Instruct"
    max_new_tokens: 2048

Adding a New Specialist

Add the model to GGUF_REGISTRY in loader/airllm_loader.py:

GGUF_REGISTRY = {
    # ... existing models ...
    "your-org/your-model": (
        "gguf-repo-id/your-model-GGUF",
        "your-model.Q4_K_M.gguf"
    ),
}

Create a specialist class in specialists/your_domain.py:

from .base_specialist import BaseSpecialist

class YourSpecialist(BaseSpecialist):
    def generate(self, prompt: str) -> str:
        response = self.model.create_chat_completion(
            messages=[{"role": "user", "content": prompt}],
            max_tokens=self.max_new_tokens,
            temperature=0.7,
        )
        return self._postprocess(
            response["choices"][0]["message"]["content"]
        )

Register it in orchestrator.py:

from specialists.your_domain import YourSpecialist

SPECIALIST_MAP = {
    # ... existing specialists ...
    "your_domain": YourSpecialist,
}

Add it to config.yaml:

specialists:
  your_domain:
    model_id: "your-org/your-model"
    max_new_tokens: 2048

Update the router's domain list in router/classifier.py to include your new domain.

🗺️ Roadmap

🐛 Known Limitations

First-query latency: The first query for a new domain takes 1-6s to load the specialist; subsequent same-domain queries are instant (hot cache)
Context window: 7B models are limited to 1024 tokens of context to fit in VRAM
Single query at a time: The system processes one query before accepting the next
Context continuity for the API: The conversation history is session-bound to the LLMRouter instance; the REST API resets on each server restart
Domain continuity limitations: Very ambiguous short queries (e.g., "More?") are biased toward the previous domain, which may not always be correct

🛠️ Tech Stack

Component	Technology	Purpose
Inference Engine	llama-cpp-python	GPU-accelerated GGUF model inference
Model Source	Hugging Face Hub	Auto-download quantized GGUF models
CLI Interface	Rich	Beautiful terminal dashboards and panels
REST API	FastAPI + Uvicorn	Async HTTP server
Configuration	PyYAML	YAML-based model and parameter configuration
Environment	python-dotenv	Secure token management via `.env`
Model Format	GGUF	Quantized model format (Q2_K, Q4_K_M)

🤝 Contributing

Contributions are welcome! See CONTRIBUTING.md for the full guide covering development setup, how to add a new specialist, code style, and the PR process.

Quick ideas to get started:

🌍 New domain specialists — Add support for science, finance, history, etc.
🧪 Evaluation benchmarks — Build test suites to measure routing accuracy and specialist quality
🖥️ Web UI — Build a Gradio or Streamlit frontend
📊 Metrics dashboard — Track routing accuracy, latency, and VRAM usage over time
🏎️ Performance optimization — Explore batch inference, speculative decoding, or model caching strategies
📝 Better prompts — Improve specialist system prompts for higher-quality responses
🐛 Bug fixes — Check the Issues tab for known bugs

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

Note: The individual models used by MELLM have their own licenses. Please check each model's Hugging Face page for their specific terms before redistribution.

🙏 Acknowledgements

Qwen Team — For the excellent Qwen2.5 series of small instruction-tuned models
BioMistral — For the medical domain-adapted BioMistral-7B-DARE
mradermacher — For the Magistrate legal model GGUF quantization
bartowski — For high-quality GGUF quantizations of math models
Georgi Gerganov — For the ggml library and GGUF format
Andrei Betlen — For llama-cpp-python, the Python bindings for llama.cpp

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
agents		agents
benchmarks		benchmarks
loader		loader
router		router
setup		setup
specialists		specialists
webui		webui
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
api.py		api.py
cli.py		cli.py
config.yaml		config.yaml
mellm-demo.gif		mellm-demo.gif
orchestrator.py		orchestrator.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation