GitHub - DIGIT-X-Lab/MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction of Healthcare data using local LLMs

DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.

How It Works

flowchart LR
    A["PDF / Image / Text"] --> B["Dual-Engine OCR"]
    B --> C["DSPy Pipeline"]
    C --> D["Validated JSON"]

    style A fill:#B5A89A,stroke:#8a7e72,color:#fff
    style B fill:#E87461,stroke:#c25a49,color:#fff
    style C fill:#E87461,stroke:#c25a49,color:#fff
    style D fill:#B5A89A,stroke:#8a7e72,color:#fff

MOSAICX ships with specialized pipelines for radiology and pathology reports, a generic extraction mode that adapts to any document, plus de-identification and patient timeline summarization. Every pipeline is a DSPy module -- meaning it can be optimized with labeled data for your specific use case.

Why MOSAICX? -- Fully local (no PHI leaves your machine), schema-driven (define exactly what to extract), dual-engine OCR (handles scans and handwriting), and DSPy-optimizable (improve accuracy with your own labeled data). One CLI for radiology, pathology, de-identification, and summarization.

Quick Start

One-line install (Mac or Linux):

curl -fsSL https://raw.githubusercontent.com/DIGIT-X-Lab/MOSAICX/master/scripts/setup.sh | bash

Or install manually and let the setup wizard configure everything:

pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git
mosaicx setup

With uv (faster):

uv pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git
mosaicx setup

Then extract structured data from a report:

mosaicx extract --document report.pdf --mode radiology

Check health anytime with mosaicx doctor. See the full Quickstart guide for details.

Install Extras

pip install 'mosaicx[mcp] @ git+https://github.com/DIGIT-X-Lab/MOSAICX.git'       # + MCP server
pip install 'mosaicx[query] @ git+https://github.com/DIGIT-X-Lab/MOSAICX.git'     # + query stack
pip install 'mosaicx[all] @ git+https://github.com/DIGIT-X-Lab/MOSAICX.git'       # everything

Developer Fast Loop (Mac + vLLM-MLX, 120B)

# 1) serve your local model (already downloaded)
vllm-mlx serve mlx-community/gpt-oss-120b-4bit --port 8000

# 2) point MOSAICX to that server
export MOSAICX_LM=openai/mlx-community/gpt-oss-120b-4bit
export MOSAICX_API_BASE=http://127.0.0.1:8000/v1
export MOSAICX_API_KEY=dummy

# 3) verify the endpoint
curl -sS --max-time 5 http://127.0.0.1:8000/v1/models

# 4) run extraction + claim verify + query
mosaicx extract --document report.pdf --mode radiology -o output.json
mosaicx verify --document report.pdf --claim "patient BP is 128/82" --level thorough
mosaicx query --document report.pdf --chat --trace

Tip

Not on Apple Silicon? Use Ollama, vLLM, or any OpenAI-compatible server. See the Getting Started guide for all backend options.

Tip

Want the fastest first success? Follow docs/quickstart.md to run extract, verify, and query end-to-end in ~10 minutes.

What You Can Do

Capability	Commands	Guide
Extract structured data from clinical documents	`mosaicx extract`	Pipelines
Create and manage templates for custom extraction targets	`mosaicx template create / list / refine`	Schemas & Templates
Verify claims and outputs against source evidence	`mosaicx verify`	CLI Reference
Query sources conversationally with citations	`mosaicx query`	CLI Reference
De-identify reports (LLM + regex belt-and-suspenders)	`mosaicx deidentify`	CLI Reference
Summarize patient timelines across multiple reports	`mosaicx summarize`	CLI Reference
Optimize pipelines with labeled data (DSPy)	`mosaicx optimize`, `mosaicx eval`	Optimization
Extend with custom pipelines, MCP server, Python SDK	`mosaicx pipeline new`, `mosaicx mcp serve`	Developer Guide

Run any command with --help for full options. Complete reference: docs/cli-reference.md

Recipes

# Radiology report -> structured JSON
mosaicx extract --document ct_chest.pdf --mode radiology

# Template-driven extraction (define your own fields)
mosaicx template create --describe "echo report with LVEF, valve grades, impression"
mosaicx extract --document echo.pdf --template EchoReport

# Batch-process a folder of reports
mosaicx extract --dir ./reports --output-dir ./structured --mode radiology --format jsonl

# De-identify a clinical note
mosaicx deidentify --document note.txt

# Patient timeline from multiple reports
mosaicx summarize --dir ./patient_001/ --patient P001

See the full CLI Reference for every flag and option.

Privacy

Important

Data stays on your machine. MOSAICX runs against a local inference server by default -- no external API calls, no cloud uploads. For HIPAA/GDPR compliance guidance and cloud backend caveats, see Configuration.

LLM Backends

MOSAICX talks to any OpenAI-compatible endpoint via DSPy + litellm. Pick the backend that fits your hardware -- override with env vars.

Backend	Port	Example
Ollama	11434	Works out-of-the-box, no config needed
llama.cpp	8080	`llama-server -m model.gguf --port 8080`
vLLM	8000	`vllm serve gpt-oss:120b`
SGLang	30000	`python -m sglang.launch_server --model-path gpt-oss:120b`
vLLM-MLX	8000	`vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8` (Apple Silicon)

export MOSAICX_LM=openai/gpt-oss:120b
export MOSAICX_API_BASE=http://localhost:8000/v1   # point at your server
export MOSAICX_API_KEY=dummy                       # or your real key for cloud APIs

SSH tunneling, vLLM-MLX setup, batch tuning, and benchmarking: docs/configuration.md

OCR Engines

Engine	Approach	Best for
Surya	Layout detection + recognition	Clean printed text, fast
Chandra	Vision-Language Model (Qwen3-VL 9B)	Handwriting, complex layouts, tables

By default both engines run in parallel, score each page, and pick the best result. Override with MOSAICX_OCR_ENGINE=surya or chandra.

Configuration

# Essential vars -- point at your local server
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8   # model name
export MOSAICX_API_BASE=http://localhost:8000/v1                # server URL
export MOSAICX_API_KEY=dummy                                    # or real key for cloud

# View active config
mosaicx config show

Full variable reference, .env file setup, and backend scenarios: docs/configuration.md

Documentation

Guide	Description
Quickstart	Fast setup and first successful run in ~10 minutes
Getting Started	Install, first extraction, basics
Verify Guide	Truth/adjudication workflows for claims and extraction output
Query Guide	Grounded multi-turn querying with evidence and confidence
Troubleshooting	Debug slow query, wrong stats, fallback, and runtime issues
Production Checklist	Deploy with reproducibility, gating, and auditability controls
CLI Reference	Every command, every flag, examples
Pipelines	Pipeline inputs/outputs, JSONL formats
Schemas & Templates	Create and manage extraction schemas
Optimization	Improve accuracy with DSPy optimizers
Configuration	Env vars, backends, OCR, export formats
MCP Server	AI agent integration via MCP
Developer Guide	Custom pipelines, Python SDK
Architecture	System design, key decisions

Development

git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]"          # or: uv sync --group dev
pytest tests/ -q

See Developer Guide for custom pipelines and the Python SDK.

Citation

@software{mosaicx2025,
  title   = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author  = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
  year    = {2025},
  url     = {https://github.com/DIGIT-X-Lab/MOSAICX},
  doi     = {10.5281/zenodo.17601890}
}

License

Apache 2.0 -- see LICENSE.

Contact

Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues

Name		Name	Last commit message	Last commit date
Latest commit History 441 Commits
.github		.github
assets		assets
docker		docker
docs		docs
examples		examples
mosaicx		mosaicx
schemas		schemas
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.slim		Dockerfile.slim
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
mcp-config.example.json		mcp-config.example.json
publish.sh		publish.sh
pypi-publish.sh		pypi-publish.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How It Works

Quick Start

Install Extras

Developer Fast Loop (Mac + vLLM-MLX, 120B)

What You Can Do

Recipes

Privacy

LLM Backends

OCR Engines

Configuration

Documentation

Development

Citation

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How It Works

Quick Start

Install Extras

Developer Fast Loop (Mac + vLLM-MLX, 120B)

What You Can Do

Recipes

Privacy

LLM Backends

OCR Engines

Configuration

Documentation

Development

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages