- 🚧 THIS IS A WORK IN PROGRESS! More will be added soon!
- Feel free to contribute by submitting a pull request 🙏
- Cells marked with ✅ or ❌ have been independently tested. Blank cells indicate that the feature has not yet been independently tested.
- See the
resultsfolder to see the outputs from models.
Usually outputs as raw text or markdown
| Models | Source | Output | Needs prompt? | Table | Equation | Figure | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|---|---|
| PyMuPDF | Raw text | N | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | |
| PDFPlumber | Raw text | N | ✅ (separate from text) | ❌ | ❌ | ❌ | ❌ | ❌ |
| Models | Source | Output | Needs prompt? | Table | Equation | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|---|
| Marker | Markdown | N | ✅ (markdown) | ✅ | ✅ | ✅ | ❌ | |
| MonkeyOCR | Markdown | Y | ✅ (html) | ✅ | ✅ | ✅ | ✅ | |
| Nougat | Markdown | N | ❌ | ✅ | ✅ | ✅ | ❌ | |
| MinerU | Markdown | N | ✅ (html) | ✅ | ❌ | ✅ | ❌ | |
| Llamaparse (balanced mode) | - | Markdown | Y | ✅ (markdown) | ❌ | ❌ | ✅ | ❌ |
| Llamaparse (premium mode) | - | Markdown | Y | ✅ (markdown) | ❌ | ❌ | ✅ | ❌ |
| Docling | Markdown | N | ✅ (markdown) | ❌ | ❌ | ✅ | ✅ | |
| RolmOCR | Markdown | Y | ✅ (markdown) | ✅ | ✅ | ✅ | † | |
| olmOCR | Markdown | Y | ✅ (markdown) | ✅ | ✅ | ✅ | † | |
| Unstructured | Raw text | N | ❌ | ❌ | ❌ | ❌ | ✅ | |
| Pytesseract | Raw text | N | ❌ | ❌ | ❌ | ✅ | ✅ | |
| MarkItDown | Markdown | N | ❌ | ❌ | ❌ | ✅ | ✅ | |
| Amazon textract | - | |||||||
| Azure AI Document Intelligence | - | |||||||
| Google Cloud OCR | - | |||||||
| Mathpix | - | |||||||
| MistralOCR | - | |||||||
| Upstage | - | |||||||
| OmniAI | - | |||||||
| ChatDoc PDF parser | - | |||||||
| Reducto | - | |||||||
| OCRFlux | ||||||||
| Nanonets | ||||||||
| PaddleOCR | ||||||||
| ClovaOCR | - | |||||||
| ParseExtract | - | |||||||
| Tensorlake | - | |||||||
| Vectorize | - | |||||||
| MassivePix | - | |||||||
| Dolphin | ||||||||
| GOT | ||||||||
| Manga OCR | ||||||||
| EasyOCR | ||||||||
| PDFeditify | - |
† Process took too long
Usually outputs as JSON containing bounding box coordinates, content (as raw text or markdown), and sometimes type (header, figure, paragraph, etc.)
🚧 WORK IN PROGRESS
| Models | Source | Output | Table | Equation | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|
| Chunkr | |||||||
| GroundX | - | ||||||
| ChatDOC | - | ||||||
| Unstract |
If you would like to contribute in any way, please read CONTRIBUTING.md and then make a contribution. Thank you!

