PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
-
Updated
Apr 16, 2026 - Java
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf
Use TradeRepublic in terminal and mass download all documents
JavaScript bindings for MuPDF
Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.
Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.
Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.
Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.
Translate many large PDF Reports for free using Python.
A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.
Web content extraction engine backed by Qt WebEngine.
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
Find what gets buried in the data room. Open-source integrated M&A due diligence — legal, financial, commercial, and technical analysis across every contract, cross-referenced with exact citations.
🚀 Simplify your research workflow with Claude Scholar, the complete configuration for Claude Code in data science, AI, and academic writing.
GDocz by Gramosoft is an open-source Intelligent Document Processing platform that turns raw PDFs and images into clean, structured JSON — powered by multi-engine OCR and AI-driven schema extraction.
Extract presentation slides from videos with accurate timestamps
This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.
Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.
To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."