DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.
flowchart LR
A["PDF / Image / Text"] --> B["Dual-Engine OCR"]
B --> C["DSPy Pipeline"]
C --> D["Validated JSON"]
style A fill:#B5A89A,stroke:#8a7e72,color:#fff
style B fill:#E87461,stroke:#c25a49,color:#fff
style C fill:#E87461,stroke:#c25a49,color:#fff
style D fill:#B5A89A,stroke:#8a7e72,color:#fff
MOSAICX ships with specialized pipelines for radiology and pathology reports, a generic extraction mode that adapts to any document, plus de-identification and patient timeline summarization. Every pipeline is a DSPy module -- meaning it can be optimized with labeled data for your specific use case.
Why MOSAICX? -- Fully local (no PHI leaves your machine), schema-driven (define exactly what to extract), dual-engine OCR (handles scans and handwriting), and DSPy-optimizable (improve accuracy with your own labeled data). One CLI for radiology, pathology, de-identification, and summarization.
One-line install (Mac or Linux):
curl -fsSL https://raw.githubusercontent.com/DIGIT-X-Lab/MOSAICX/master/scripts/setup.sh | bashOr install manually and let the setup wizard configure everything:
pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git
mosaicx setupWith uv (faster):
uv pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git
mosaicx setupThen extract structured data from a report:
mosaicx extract --document report.pdf --mode radiologyCheck health anytime with mosaicx doctor. See the full Quickstart guide for details.
pip install 'mosaicx[mcp] @ git+https://github.com/DIGIT-X-Lab/MOSAICX.git' # + MCP server
pip install 'mosaicx[query] @ git+https://github.com/DIGIT-X-Lab/MOSAICX.git' # + query stack
pip install 'mosaicx[all] @ git+https://github.com/DIGIT-X-Lab/MOSAICX.git' # everything# 1) serve your local model (already downloaded)
vllm-mlx serve mlx-community/gpt-oss-120b-4bit --port 8000
# 2) point MOSAICX to that server
export MOSAICX_LM=openai/mlx-community/gpt-oss-120b-4bit
export MOSAICX_API_BASE=http://127.0.0.1:8000/v1
export MOSAICX_API_KEY=dummy
# 3) verify the endpoint
curl -sS --max-time 5 http://127.0.0.1:8000/v1/models
# 4) run extraction + claim verify + query
mosaicx extract --document report.pdf --mode radiology -o output.json
mosaicx verify --document report.pdf --claim "patient BP is 128/82" --level thorough
mosaicx query --document report.pdf --chat --traceTip
Not on Apple Silicon? Use Ollama, vLLM, or any OpenAI-compatible server. See the Getting Started guide for all backend options.
Tip
Want the fastest first success? Follow docs/quickstart.md to run extract, verify, and query end-to-end in ~10 minutes.
| Capability | Commands | Guide |
|---|---|---|
| Extract structured data from clinical documents | mosaicx extract |
Pipelines |
| Create and manage templates for custom extraction targets | mosaicx template create / list / refine |
Schemas & Templates |
| Verify claims and outputs against source evidence | mosaicx verify |
CLI Reference |
| Query sources conversationally with citations | mosaicx query |
CLI Reference |
| De-identify reports (LLM + regex belt-and-suspenders) | mosaicx deidentify |
CLI Reference |
| Summarize patient timelines across multiple reports | mosaicx summarize |
CLI Reference |
| Optimize pipelines with labeled data (DSPy) | mosaicx optimize, mosaicx eval |
Optimization |
| Extend with custom pipelines, MCP server, Python SDK | mosaicx pipeline new, mosaicx mcp serve |
Developer Guide |
Run any command with --help for full options. Complete reference: docs/cli-reference.md
# Radiology report -> structured JSON
mosaicx extract --document ct_chest.pdf --mode radiology
# Template-driven extraction (define your own fields)
mosaicx template create --describe "echo report with LVEF, valve grades, impression"
mosaicx extract --document echo.pdf --template EchoReport
# Batch-process a folder of reports
mosaicx extract --dir ./reports --output-dir ./structured --mode radiology --format jsonl
# De-identify a clinical note
mosaicx deidentify --document note.txt
# Patient timeline from multiple reports
mosaicx summarize --dir ./patient_001/ --patient P001See the full CLI Reference for every flag and option.
Important
Data stays on your machine. MOSAICX runs against a local inference server by default -- no external API calls, no cloud uploads. For HIPAA/GDPR compliance guidance and cloud backend caveats, see Configuration.
MOSAICX talks to any OpenAI-compatible endpoint via DSPy + litellm. Pick the backend that fits your hardware -- override with env vars.
| Backend | Port | Example |
|---|---|---|
| Ollama | 11434 | Works out-of-the-box, no config needed |
| llama.cpp | 8080 | llama-server -m model.gguf --port 8080 |
| vLLM | 8000 | vllm serve gpt-oss:120b |
| SGLang | 30000 | python -m sglang.launch_server --model-path gpt-oss:120b |
| vLLM-MLX | 8000 | vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 (Apple Silicon) |
export MOSAICX_LM=openai/gpt-oss:120b
export MOSAICX_API_BASE=http://localhost:8000/v1 # point at your server
export MOSAICX_API_KEY=dummy # or your real key for cloud APIsSSH tunneling, vLLM-MLX setup, batch tuning, and benchmarking: docs/configuration.md
| Engine | Approach | Best for |
|---|---|---|
| Surya | Layout detection + recognition | Clean printed text, fast |
| Chandra | Vision-Language Model (Qwen3-VL 9B) | Handwriting, complex layouts, tables |
By default both engines run in parallel, score each page, and pick the best result. Override with MOSAICX_OCR_ENGINE=surya or chandra.
# Essential vars -- point at your local server
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8 # model name
export MOSAICX_API_BASE=http://localhost:8000/v1 # server URL
export MOSAICX_API_KEY=dummy # or real key for cloud
# View active config
mosaicx config showFull variable reference, .env file setup, and backend scenarios: docs/configuration.md
| Guide | Description |
|---|---|
| Quickstart | Fast setup and first successful run in ~10 minutes |
| Getting Started | Install, first extraction, basics |
| Verify Guide | Truth/adjudication workflows for claims and extraction output |
| Query Guide | Grounded multi-turn querying with evidence and confidence |
| Troubleshooting | Debug slow query, wrong stats, fallback, and runtime issues |
| Production Checklist | Deploy with reproducibility, gating, and auditability controls |
| CLI Reference | Every command, every flag, examples |
| Pipelines | Pipeline inputs/outputs, JSONL formats |
| Schemas & Templates | Create and manage extraction schemas |
| Optimization | Improve accuracy with DSPy optimizers |
| Configuration | Env vars, backends, OCR, export formats |
| MCP Server | AI agent integration via MCP |
| Developer Guide | Custom pipelines, Python SDK |
| Architecture | System design, key decisions |
git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]" # or: uv sync --group dev
pytest tests/ -qSee Developer Guide for custom pipelines and the Python SDK.
@software{mosaicx2025,
title = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
year = {2025},
url = {https://github.com/DIGIT-X-Lab/MOSAICX},
doi = {10.5281/zenodo.17601890}
}Apache 2.0 -- see LICENSE.
Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues
