pdf-extraction

Star

Here are 191 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

Star

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

Updated Apr 24, 2026
Java

kreuzberg-dev / kreuzberg

Star

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

Updated Apr 24, 2026
Rust

firecrawl / pdf-inspector

Star

Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.

nodejs python markdown rust pdf text-extraction pdf-parser pdf-extraction ocr-routing pdf-classification

Updated Apr 22, 2026
Rust

24eme / signaturepdf

Star

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf

php pdf js signature pdf-manipulation pdf-merge pdf-format pdf-rotate pdf-merger pdf-meta-editor pdf-tools pdf-signature pdf-compression pdf-editor pdf-sign pdf-extraction pdf-signer pdf-metadata pdf-compressor

Updated Apr 21, 2026
JavaScript

pytr-org / pytr

Star

Use TradeRepublic in terminal and mass download all documents

portfolio finance terminal-app portfolio-performance pdf-extraction traderepublic-statements traderepublic

Updated Apr 22, 2026
Python

ArtifexSoftware / mupdf.js

Star

JavaScript bindings for MuPDF

javascript pdf typescript wasm mupdf pdf-viewer pdf-extraction

Updated Apr 8, 2026

mateogon / pdf-narrator

Star

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

pdf text-to-speech audiobook tts epub low-resource pdf-extraction pdf-to-audiobook immersive-reading kokoro-tts audiobook-generator pdf-audiobook

Updated Feb 26, 2026
Python

aiptimizer / TurboOCR

Star

Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.

ocr grpc nvidia text-recognition text-detection inference-server fp16 tensorrt rag fastapi pdf-extraction paddleocr easyocr document-ai document-parsing qwen-vl gpu-ocr

Updated Apr 24, 2026
C++

ExtractPDF4J / ExtractPDF4J

Sponsor

Star

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

java cli ocr maven pdf-document pdf-extractor ocr-recognition document-processing pdf-processor pdf-document-processor pdf-extraction java17

Updated Mar 15, 2026
Java

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

NameetP / pdfmux

Star

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

python pdf ocr mcp self-healing structured-extraction rag pdf-to-json pdf-extraction ai-agent llm document-parsing pdf-to-markdown docling opendataloader

Updated Apr 16, 2026
Python

pcschreiber1 / PDF_Extraction-Translation

Star

Translate many large PDF Reports for free using Python.

python pdf-extraction pdf-translation

Updated Dec 31, 2022
Jupyter Notebook

heleninsights-dot / phd-deepread-workflow

Star

A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.

python pdf workflow research academic obsidian literature-review pdf-extraction

Updated Mar 6, 2026
Python

wszqkzqk / qt-web-extractor

Star

Web content extraction engine backed by Qt WebEngine.

mcp chromium web-scraping qtwebengine content-extraction headless-browser pdf-extraction pyside6 open-webui mcp-server

Updated Apr 21, 2026
Python

aidalinfo / extract-kit

Star

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

pdf document-processing ai-sdk pdf-extraction vision-llm

Updated Sep 14, 2025
TypeScript

zoharbabin / due-diligence-agents

Sponsor

Star

Find what gets buried in the data room. Open-source integrated M&A due diligence — legal, financial, commercial, and technical analysis across every contract, cross-referenced with exact citations.

Updated Apr 16, 2026
Python

jessevanwyk1 / claude-scholar

Star

🚀 Simplify your research workflow with Claude Scholar, the complete configuration for Claude Code in data science, AI, and academic writing.

search mcp academic pubmed summarization research-tool reading-list arxiv ai-safety literature-review scientific-literature semantic-scholar pdf-extraction streamlit academic-papers academic-research research-tools mcp-server claude-code

Updated Apr 24, 2026
TeX

GramosoftAI / GdoczAI

Star

GDocz by Gramosoft is an open-source Intelligent Document Processing platform that turns raw PDFs and images into clean, structured JSON — powered by multi-engine OCR and AI-driven schema extraction.

open-source ocr gemini data-extraction document-processing pdf-extraction document-ai intelligent-document-processing qwen enterprise-ai olmocr2 multi-model-ocr

Updated Mar 30, 2026
Python

MarkShawn2020 / video2ppt

Star

Extract presentation slides from videos with accurate timestamps

python opencv video-processing cli-tool frame-extraction pdf-extraction video-to-slides presentation-extraction

Updated Apr 12, 2026
TypeScript

clark-labs-inc / pdfsink-rs

Star

Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.

rust pdf text-extraction rust-library pdf-to-text rust-crate table-extraction pdf-parser document-processing layout-analysis pdf-to-json pdf-extraction pdfplumber document-ai clark-labs

Updated Apr 11, 2026
Rust

Improve this page

Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-extraction

Here are 191 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

kreuzberg-dev / kreuzberg

firecrawl / pdf-inspector

24eme / signaturepdf

pytr-org / pytr

ArtifexSoftware / mupdf.js

mateogon / pdf-narrator

aiptimizer / TurboOCR

ExtractPDF4J / ExtractPDF4J

iamarunbrahma / pdf-to-markdown

NameetP / pdfmux

pcschreiber1 / PDF_Extraction-Translation

heleninsights-dot / phd-deepread-workflow

wszqkzqk / qt-web-extractor

aidalinfo / extract-kit

zoharbabin / due-diligence-agents

jessevanwyk1 / claude-scholar

GramosoftAI / GdoczAI

MarkShawn2020 / video2ppt

clark-labs-inc / pdfsink-rs

Improve this page

Add this topic to your repo