PDFStract — Unified PDF Ingestion for RAG Systems

Building RAG or AI knowledge systems?

PDFStract is a unified PDF extraction and document ingestion layer for RAG (Retrieval-Augmented Generation) and LLM pipelines.

It standardizes PDF extraction, OCR handling, and text chunking across multiple libraries — so you can build reliable AI knowledge systems without fighting document parsing issues.

No single PDF extractor works best for every document. PDFStract lets you switch, compare, and automate extraction strategies through a single interface.

Why PDFStract Exists

Modern RAG and LLM systems depend on clean document ingestion.

But PDF extraction is fragmented:

Some libraries work better for structured reports
Others perform better on scanned or OCR-heavy documents
Output formats vary widely
Chunking strategies significantly impact retrieval performance

Teams often waste hours testing combinations manually.

PDFStract provides:

A unified abstraction over multiple PDF extractors
Standardized output formats (markdown, json, text)
Built-in chunking strategies for RAG pipelines
Easy benchmarking and comparison between libraries

It becomes the standardized ingestion layer of your AI data pipeline.

Get started in two lines

from pdfstract import PDFStract, convert_pdf, chunk_text

text = convert_pdf("report.pdf", library="auto")
chunks = chunk_text(text, chunker="semantic", chunk_size=512)

What Makes PDFStract Different?

Instead of committing to a single PDF extraction library, PDFStract lets you:

Swap extractors with one parameter
Benchmark multiple libraries on the same document
Automate library selection
Standardize downstream processing
Keep your ingestion layer future-proof

As new extraction libraries emerge, PDFStract allows you to integrate them without rewriting your pipeline

PDFStract decouples your ingestion layer from any single extraction library.

Installation and Usage

Choose based on required extraction libraries.

pip install pdfstract
pip install pdfstract[standard]
pip install pdfstract[advanced]
pip install pdfstract[all]

CLI Usage

# List available libraries
pdfstract libs

# List available chunkers
pdfstract chunkers

# Convert a single PDF
pdfstract convert document.pdf --library pymupdf4llm --output result.md

# Convert and chunk in one command
pdfstract convert-chunk document.pdf --library pymupdf4llm --chunker semantic --output chunks.json

# Chunk an existing text file
pdfstract chunk document.md --chunker token --chunk-size 512 --output chunks.json

# Compare multiple libraries on one PDF
pdfstract compare sample.pdf -l pymupdf4llm -l marker -l docling --output ./comparison

# Batch convert 100+ PDFs in parallel
pdfstract batch ./documents --library pymupdf4llm --output ./converted --parallel 4

# Download models for a specific library
pdfstract download marker

Module Usage with Python

You don't need to use the CLI! PDFStract can be easily integrated into your Python applications as a library.

Convert a PDF (One-liner)

from pdfstract import convert_pdf

# Quick conversion with default settings
result = convert_pdf('sample.pdf', library='marker')
print(result)  # Markdown content

List Available Libraries

from pdfstract import PDFStract

pdfstract = PDFStract()

# Get list of available libraries
available = pdfstract.list_available_libraries()
print(available)  # ['pymupdf4llm', 'marker', 'docling', ...]

Structured Conversion

from pdfstract import PDFStract

pdfstract = PDFStract()

# Convert with options
result = pdfstract.convert(
    pdf_path='document.pdf',
    library='marker',
    output_format='markdown'  # or 'json', 'text'
)

Batch Processing Multiple PDFs

from pdfstract import PDFStract

pdfstract = PDFStract()

# Convert all PDFs in a directory in parallel
results = pdfstract.batch_convert(
    pdf_directory='./pdfs',
    library='pymupdf4llm',
    output_format='markdown',
    parallel_workers=4
)

print(f"✓ Success: {results['success']}")
print(f"✗ Failed: {results['failed']}")

Async Conversion (for Web Apps)

import asyncio
from pdfstract import PDFStract

async def process_pdfs():
    pdfstract = PDFStract()
    result = await pdfstract.convert_async(
        'document.pdf',
        library='docling',
        output_format='json'
    )
    return result

# Use in FastAPI, asyncio, etc.
asyncio.run(process_pdfs())

Text Chunking for RAG Pipelines

from pdfstract import PDFStract

pdfstract = PDFStract()

# 1. Extract PDF
text = pdfstract.convert('document.pdf', library='docling')

# 2. Chunk the text
chunks = pdfstract.chunk(
    text=text,
    chunker='semantic',  # or 'token', 'sentence', 'code', etc.
    chunk_size=512
)

print(f"Created {chunks['total_chunks']} chunks")

# 3. Process chunks for embedding/indexing
for chunk in chunks['chunks']:
    print(f"- {chunk['text'][:50]}... ({chunk['token_count']} tokens)")

Powerful Web UI

# Clone the repository
git clone https://github.com/aksarav/pdfstract.git
cd pdfstract

# Download models and start services (first time)
make up

# Or step by step:
make models   # Download HuggingFace/MinerU models (~10GB)
make build    # Build Docker images
make up       # Start services

Used For

RAG systems
Document intelligence pipelines
Knowledge base ingestion
LLM fine-tuning datasets

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📞 Support

If you encounter issues or have questions - please create an issue

🌟 Please leave a star if you find this project useful

Made with ❤️ for AI RAG pipelines

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.vscode		.vscode
data		data
docs		docs
frontend		frontend
pdfstract		pdfstract
scripts		scripts
services		services
tests		tests
uploads		uploads
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
CLI_README.md		CLI_README.md
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
UI.png		UI.png
UI2.png		UI2.png
UI3.png		UI3.png
build_and_upload.sh		build_and_upload.sh
cli.py		cli.py
docker-compose.yml		docker-compose.yml
main.py		main.py
makefile		makefile
nginx.conf		nginx.conf
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFStract — Unified PDF Ingestion for RAG Systems

Why PDFStract Exists

What Makes PDFStract Different?

Installation and Usage

CLI Usage

Module Usage with Python

Convert a PDF (One-liner)

List Available Libraries

Structured Conversion

Batch Processing Multiple PDFs

Async Conversion (for Web Apps)

Text Chunking for RAG Pipelines

Powerful Web UI

Used For

🤝 Contributing

📞 Support

🌟 Please leave a star if you find this project useful

About

Uh oh!

Releases 3

Packages

Languages

License

AKSarav/pdfstract

Folders and files

Latest commit

History

Repository files navigation

PDFStract — Unified PDF Ingestion for RAG Systems

Why PDFStract Exists

What Makes PDFStract Different?

Installation and Usage

CLI Usage

Module Usage with Python

Convert a PDF (One-liner)

List Available Libraries

Structured Conversion

Batch Processing Multiple PDFs

Async Conversion (for Web Apps)

Text Chunking for RAG Pipelines

Powerful Web UI

Used For

🤝 Contributing

📞 Support

🌟 Please leave a star if you find this project useful

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages