High-fidelity multi-engine OCR pipeline that converts challenging scanned documents into LLM-ready Markdown, deployed on Modal serverless GPU infrastructure.
Supported formats: PDF, PNG, JPG, JPEG, TIFF
Runs a five-engine ensemble — pdfplumber, PaddleOCR, Docling (always), TrOCR, Dots.ocr (on-demand) — with spatial alignment, majority voting, and constrained LLM reconciliation.
- Python 3.11+
- Modal account (free Starter plan: $30/month credits)
- Modal CLI:
pip install modal
git clone https://github.com/urnlahzer/omniparse.git
cd omniparse
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Run tests (no GPU needed — all GPU code is mocked)
python -m pytest omniparse/tests/ -q# Authenticate with Modal (opens browser)
modal token new
# Deploy the app — builds 7 container images
# First deploy takes 10-15 minutes (downloads ~5 GB of dependencies)
# Subsequent deploys take <5 seconds (images are cached)
modal deploy omniparse/deploy.py
# Download model weights into Modal Volume (~19 GB, runs on Modal servers)
# This takes 5-10 minutes on first run
modal run omniparse/setup_volume.pyThe first modal deploy builds container images for all engines. This involves downloading large packages:
| Image | Key packages | Size |
|---|---|---|
| cpu | pdfplumber, FastAPI, Pillow | ~200 MB |
| paddleocr | PaddlePaddle-GPU 3.3.0, PaddleOCR 3.4+ | ~2 GB |
| docling | Docling, PyTorch, onnxruntime-gpu | ~3 GB |
| trocr | Transformers, PyTorch, PaddlePaddle | ~2.5 GB |
| dots | vLLM, PyTorch | ~2 GB |
| llm_arbiter | vLLM 0.17+, Qwen-VL-utils | ~2 GB |
PaddlePaddle-GPU downloads from paddlepaddle.org.cn (Baidu's official package index — the only distribution source for the GPU build). This is normal and expected.
After first build, all images are cached on Modal. Redeploys only upload changed source code (~seconds).
Document → Preprocess → ┌─ pdfplumber (CPU) ─┐
├─ PaddleOCR (A10G) ─┤
└─ Docling (L4) ─┘
│
Noise Filter ── removes line/page numbers, footer fragments
│
Docling Pre-Merge ── joins word-level fragments into lines
│
Alignment → Consensus → Markdown
↑ ↑
│ │
┌─ TrOCR (L4) ─┐ │ LLM Arbiter
└─ Dots.ocr (A10G) ─┘ │ (Qwen3-VL-8B)
(on-demand) │
│
Smart Dispatch ─────────┘
(PaddleOCR labels → specialist engines)
- pdfplumber (CPU): Embedded text extraction with character-level bounding boxes
- PaddleOCR PP-StructureV3 (A10G GPU): Layout classification + OCR text extraction
- Docling (L4 GPU): Hierarchical structure, table recovery, reading order
- TrOCR (L4 GPU): Handwriting recognition with DBNet line segmentation
- Dots.ocr (A10G GPU): Formula-to-LaTeX, chart-to-SVG
- Noise Filter: Removes formatting artifacts before cross-engine matching — court line numbers in the left margin, page numbers at page bottom, and footer fragments. Prevents layout noise from inflating single-engine counts.
- Docling Pre-Merge: Joins word-level Docling fragments that sit on the same text line into single regions. Docling produces word-level bounding boxes where PaddleOCR detects full lines; without merging, the granularity mismatch causes orphans during IoU alignment. Uses union-find for transitive merging with conservative thresholds (vertical overlap + horizontal gap < 5% of page width).
- IoU Alignment: Match bounding boxes across engines (six-stage pipeline, see Design Notes)
- Text Alignment: Needleman-Wunsch sequence alignment via
sequence-align - Consensus Entropy: Route low-CE to majority voting, high-CE to LLM arbitration
- LLM Arbitration: Qwen3-VL-8B-Instruct FP8 with 3-layer hallucination defense
- Markdown Compilation: GFM output with headers, tables, LaTeX, HITL flags
import modal
Pipeline = modal.Cls.from_name("omniparse", "Pipeline")
pipeline = Pipeline()
with open("document.pdf", "rb") as f:
result = pipeline.process.remote(f.read(), "document.pdf")
print(result["markdown"]) # GFM Markdown output
print(result["processing_log"]) # Full audit trail
print(result["metadata"]) # Page count, duration, cost estimatePass budget_usd to cap per-job spending. Processing stops after the page that exceeds the budget and returns partial results:
result = pipeline.process.remote(file_bytes, "large.pdf", budget_usd=0.50)
print(result["metadata"]["budget_exceeded"]) # True if capped earlycall = pipeline.process.spawn(file_bytes, "document.pdf")
# ... do other work ...
result = call.get(timeout=120)# Synchronous — process and return result
curl -X POST https://<your-namespace>--omniparse-parse-document.modal.run \
-F "file=@document.pdf"For long-running documents, use the async job endpoints. All endpoints require an X-API-Key header.
# Submit a document (returns job_id immediately)
curl -X POST https://<your-namespace>--omniparse-api.modal.run/submit \
-H "X-API-Key: <your-key>" \
-F "file=@document.pdf"
# Optional form fields:
# callback_url — webhook URL for completion notification
# ce_threshold — consensus entropy cutoff (default 0.4)
# confidence_floor — minimum confidence to accept (default 0.0)
# Poll job status
curl https://<your-namespace>--omniparse-api.modal.run/status/{job_id} \
-H "X-API-Key: <your-key>"
# Retrieve result when complete
curl https://<your-namespace>--omniparse-api.modal.run/result/{job_id} \
-H "X-API-Key: <your-key>"When callback_url is provided, a webhook is fired on job completion with an HMAC-SHA256 signature (signed with OMNIPARSE_WEBHOOK_SECRET) in the X-Webhook-Signature header.
After deployment, a web UI for reviewing flagged OCR regions is available at:
https://<your-namespace>--omniparse-hitl-web-app.modal.run
GFM Markdown with YAML frontmatter:
---
title: document.pdf
pages: 10
processed: "2026-03-20T15:30:00Z"
hitl_flags: 2
---
# Section Header
Regular paragraph text extracted from the document.
| Column 1 | Column 2 |
|----------|----------|
| Data | Values |
$$E = mc^2$$
*handwritten annotation* <!-- handwritten -->
<!-- REVIEW NEEDED: [100,200,500,300] page=3 confidence=0.35 -->omniparse/
├── app.py # Modal App, container images, HITL/cron
├── deploy.py # Deploy entrypoint (imports all modules)
├── setup_volume.py # One-time model weight download
├── pipeline.py # End-to-end orchestrator
├── preprocess.py # DPI normalization, deskew, PDF chunking
├── alignment.py # IoU matching, Needleman-Wunsch alignment
├── consensus.py # CE computation, voting, arbitration wiring
├── llm_arbiter.py # Qwen3-VL-8B with hallucination safeguards
├── markdown_compiler.py # GFM Markdown output
├── dispatch.py # Smart routing to specialist engines
├── normalization.py # Bounding box coordinate conversion
├── quality_check.py # pdfplumber ground truth detection
├── noise_filter.py # Pre-alignment removal of line/page numbers, footers
├── docling_premerge.py # Join word-level Docling fragments into lines
├── wbf.py # Weighted Boxes Fusion (vendored, MIT)
├── type_compatibility.py # Type compatibility utilities
├── observability.py # Processing log builder, cost estimation
├── engines/
│ ├── pdfplumber_engine.py
│ ├── paddleocr_engine.py
│ ├── docling_engine.py
│ ├── trocr_engine.py
│ └── dots_engine.py
├── models/
│ ├── region.py # Region, EngineOutput schemas
│ ├── consensus.py # AlignedRegion, ConsensusResult
│ ├── pipeline.py # PipelineResult, ProcessingLog
│ └── page.py # PagePayload schema
├── api/
│ ├── router.py # Async job API (/submit, /status, /result)
│ ├── auth.py # API key validation
│ ├── webhooks.py # HMAC-SHA256 signed webhook delivery
│ └── cost_guard.py # Per-job budget enforcement
├── hitl/
│ ├── router.py # Review UI endpoints (FastAPI)
│ ├── models.py # HITL data schemas
│ └── templates/ # Jinja2 templates (list, detail views)
├── regression/ # Benchmark runner + baselines
└── tests/ # 498 tests across 31 files, all mocked (no GPU needed)
All tests run locally without GPU or Modal credentials:
# Quick run
python -m pytest omniparse/tests/ -q
# Verbose with coverage
python -m pytest omniparse/tests/ -v --tb=longGPU engines are tested via mock objects. Modal configuration is verified via source inspection.
Copy example.env to .env and configure:
| Variable | Required | Description |
|---|---|---|
OMNIPARSE_SAMPLES_DIR |
No | Path to sample PDFs for integration tests |
OMNIPARSE_WEBHOOK_SECRET |
Yes (prod) | Shared secret for HMAC-SHA256 webhook signing (defaults to "default-secret" — set a real value in production) |
The async API authenticates requests via X-API-Key header. Keys are stored in a Modal Dict (omniparse-api-keys). Each key tracks cumulative spend against an optional budget_usd cap — requests that would exceed the budget are rejected with HTTP 402.
On Modal Starter plan ($30/month free credits):
| Resource | Rate | Typical usage |
|---|---|---|
| A10G GPU | ~$1.10/hr | PaddleOCR, Dots.ocr, LLM arbiter |
| L4 GPU | ~$0.59/hr | Docling, TrOCR |
| CPU | ~$0.16/hr | pdfplumber, preprocessing, compilation |
Estimated amortized cost: <$0.008 per page at scale.
All engines scale to zero when idle (min_containers=0). Use the warm_engines() function before batch processing to pre-warm containers.
This section documents hard-won lessons from building the multi-engine ensemble. These are things that aren't obvious from the code alone and would otherwise be lost.
The single biggest obstacle to cross-engine alignment was bounding box coordinate systems. Each engine outputs coordinates differently:
- pdfplumber: PDF points (72 DPI), top-left origin
- PaddleOCR: Pixel coordinates (300 DPI), top-left origin
- Docling: PDF points (72 DPI), bottom-left origin (Y-axis flipped)
Without normalization, IoU between engines was essentially zero — pdfplumber boxes were ~4x smaller than PaddleOCR boxes in the same space. Multi-engine voting was 0% for the first deployed version.
Solution: Normalize all coordinates to [0,1] page-fraction space before any spatial matching. This makes alignment resolution-independent and engine-agnostic. The normalization module (normalization.py) converts pixel coordinates to unit space; alignment reads bounding_box_norm when available and falls back to raw bounding_box.
Lesson: If you're combining spatial data from multiple sources, normalize to a unit coordinate space immediately. Don't try to convert between source-specific systems.
The alignment pipeline went through several iterations. A single IoU threshold doesn't work because engines disagree on bounding box boundaries in different ways. The final six-stage pipeline exists because each stage catches a specific failure mode the previous stages miss:
-
IoU matching (threshold >0.5): Catches well-aligned boxes. The threshold started at 0.85 (from object detection literature) but real cross-engine variance is much higher than same-model variance — 0.5 was the practical sweet spot. Lowering to 0.3 caused false matches.
-
Center-distance rescue (center distance ≤0.05, IoU >0.05): PaddleOCR and Docling often detect the same region with a 40-100px Y-axis offset on scanned documents. For small regions (~60px tall), IoU drops below 0.5 despite obvious visual correspondence. Center-distance catches these — but proximity alone caused false matches on stacked text lines (this was reverted once before adding the IoU floor requirement).
-
Containment ratio (threshold ≥0.6): Handles cross-granularity differences where one engine detects a paragraph as one box and another splits it into lines. The threshold started at 0.7 but a real document pair had CR=0.6999, so 0.6 catches these near-misses. Many-to-one merging concatenates contained regions in reading order.
-
Weighted Boxes Fusion: Groups partially overlapping boxes using weighted averaging of coordinates. Vendored from ensemble-boxes with added provenance tracking so vote counts map back to source engines.
-
Agglomerative clustering (scipy complete-linkage, distance threshold 0.92): Last-resort grouping for orphans with marginal overlap (IoU ~0.08-0.5). Uses
1 - IoUas the distance metric. -
Orphan fallback: Anything still unmatched becomes a single-engine region.
Lesson: No single spatial matching technique works across all the ways engines disagree. A layered approach with a "consumed set" (preventing double-matching) is more robust than tuning a single threshold.
NW gap character collision: The Needleman-Wunsch text alignment used _ as the gap character. This crashed on OCR text containing underscores (which is common in legal documents). Changed to null byte \x00.
Pairwise alignment length assumption: With 3 engines, NW alignment is computed pairwise (A-B, A-C, B-C). Each pair can produce a different-length aligned output. Both consensus entropy and majority voting initially assumed equal lengths, causing IndexError on real documents. Required explicit padding/truncation handling.
Docling word-level granularity: Docling produces word-level bounding boxes where PaddleOCR detects full lines. Without pre-merging, Docling fragments created massive orphan rates during IoU alignment. The conservative pre-merge step (union-find with vertical overlap + horizontal gap <5% page width) was required to make Docling regions usable in the ensemble.
Modal serialization quirk: Docling results were silently discarded for weeks because Modal serialized dictionary keys as integers but the pipeline looked them up as strings. No error was raised — just empty Docling output on every page.
Docling on image-only PDFs: Docling's get_bitmap_rects crashes on pages that are pure images (common in scanned documents). Fixed by enabling force_full_page_ocr=True, which also makes Docling an independent OCR engine rather than relying on embedded text.
PaddleOCR dependency chain: PP-StructureV3 requires paddlex[ocr] extra, not just paddleocr. Installing only paddleocr pulls paddlex without OCR sub-dependencies, causing silent failures. Additionally, paddleocr pulls opencv-contrib-python (non-headless) which requires libGL — solved by force-reinstalling opencv-python-headless after PaddleOCR.
PaddleOCR startup check: PaddleOCR/PaddleX runs a connectivity check against Chinese servers on startup. This adds ~10s to cold starts in non-China regions. Disabled via PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK=1.
Scale-to-zero everything (min_containers=0): All GPU engines scale to zero when idle. This means zero cost when not processing but ~30-60s cold start on first request. The warm_engines() function pre-spawns containers before batch jobs. An early version kept PaddleOCR at min_containers=1 for faster response, but this costs $1.10/hr idle.
Bake models into container images: First versions downloaded models from HuggingFace on every cold start (~60s overhead). Moving model downloads into the image build step (via huggingface_hub.snapshot_download) eliminates this. Models refresh automatically on redeploy; bump _PADDLE_MODEL_REV or _DOCLING_MODEL_REV to force a cache bust.
Per-engine container images: Each engine gets its own Docker image with exact dependencies. PaddleOCR needs PaddlePaddle-GPU + libGL; Docling needs PyTorch + onnxruntime-gpu; Dots.ocr needs vLLM. Sharing images would create version conflicts and bloated containers.
Dependency injection for testing: The pipeline accepts an engines= parameter for mock injection. All 498 tests run locally without GPU or Modal credentials. GPU engines are tested via mock objects that return realistic EngineOutput dicts. Modal configuration is verified via source inspection (reading app.py as Python AST).
Three-layer LLM hallucination defense: The LLM arbiter (Qwen3-VL-8B) resolves high-entropy disagreements between engines, but LLMs can hallucinate content that looks plausible but was never in the document. Three independent checks catch this: (1) edit distance — reject if output is far from all engine candidates, (2) consecutive insertion detection — reject if >5 consecutive characters appear that weren't in any candidate, (3) legal-field regex — flag if novel dollar amounts, dates, citations, or statute numbers appear. If the LLM output is rejected, the region falls back to majority vote and gets flagged for human review.
Pre-alignment noise filtering: Court documents have line numbers in the left margin, page numbers at the bottom, and footer fragments. These are real text that engines correctly detect, but they inflate single-engine orphan counts and confuse spatial matching. Filtering them before alignment (not after) was critical — otherwise they'd create spurious cross-engine "disagreements."
Ground truth detection: When pdfplumber and PaddleOCR agree >90% (measured by character-level Levenshtein), the PDF has clean embedded text and pdfplumber is treated as ground truth with 2x voting weight. This avoids unnecessary LLM arbitration on born-digital documents.
Measured on synthetic test scenarios as of the latest development phase:
| Scenario | Multi-engine resolution rate | Notes |
|---|---|---|
| Born-digital legal | 100% | pdfplumber ground truth, no arbitration needed |
| Type disagreement | 100% | Cross-type compatibility groups resolve header/printed_text |
| Scanned legal | ~62% | Improved from 28.4% through alignment layering |
| Degraded scanned | ~29% | Many orphans due to severe degradation |
| Mixed (printed + handwriting) | varies | Specialist dispatch to TrOCR/Dots.ocr |
Target accuracy thresholds:
- CER <1% on clean digital legal
- Table TEDS >95% on degraded scanned
- CER <5% on handwriting samples
- LLM invocation rate <5% on born-digital, <15% on scanned
Apache 2.0