| layout | title | parent | nav_order |
|---|---|---|---|
default |
Chapter 2: Document Processing |
Quivr Tutorial |
2 |
Welcome to Chapter 2: Document Processing. In this part of Quivr Tutorial: Open-Source RAG Framework for Document Ingestion, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
In Chapter 1, you installed Quivr and uploaded your first document. But what actually happens once a file lands in the system? This chapter dives deep into the document processing pipeline -- the engine that transforms raw PDFs, HTML pages, and plain-text files into clean, structured chunks ready for embedding.
Understanding this pipeline is critical because the quality of your RAG responses depends directly on the quality of your ingested text. Garbage in, garbage out. By the end of this chapter you will know how to extract text from every supported format, clean and normalize it, split it into semantically meaningful chunks, and troubleshoot the most common ingestion problems.
flowchart TD
A[Raw File Upload] --> B{Format Detection}
B -->|PDF| C[PDF Extractor]
B -->|HTML| D[HTML Extractor]
B -->|Plain Text| E[Text Loader]
B -->|DOCX/PPTX| F[Office Extractor]
B -->|Image| G[OCR Engine]
B -->|CSV/XLSX| H[Tabular Parser]
C --> I[Raw Text Output]
D --> I
E --> I
F --> I
G --> I
H --> I
I --> J[Text Cleaning]
J --> K[Normalization]
K --> L[Chunking Engine]
L --> M[Chunk Metadata]
M --> N[Ready for Embedding]
classDef input fill:#e1f5fe,stroke:#01579b
classDef extractor fill:#f3e5f5,stroke:#4a148c
classDef processing fill:#fff3e0,stroke:#ef6c00
classDef output fill:#e8f5e8,stroke:#1b5e20
class A input
class B,C,D,E,F,G,H extractor
class I,J,K,L,M processing
class N output
| Format | Extensions | Extractor | OCR Support | Table Extraction | Metadata Preserved |
|---|---|---|---|---|---|
.pdf |
PyMuPDF / pdfplumber | Yes | Yes | Title, author, pages | |
| HTML | .html, .htm |
BeautifulSoup | No | Yes | Title, headings |
| Plain Text | .txt, .md, .rst |
Built-in | No | No | Filename |
| Word | .docx |
python-docx | No | Yes | Title, author, styles |
| PowerPoint | .pptx |
python-pptx | No | Limited | Slide titles |
| Spreadsheet | .csv, .xlsx |
pandas / openpyxl | No | Full | Sheet names, headers |
| Images | .png, .jpg |
Tesseract OCR | Full | No | Dimensions, EXIF |
| Code | .py, .js, .ts |
Built-in | No | No | Language, filename |
PDFs are the most common document type in enterprise knowledge bases, and they are also the trickiest. Quivr uses a layered extraction strategy: it first attempts direct text extraction, then falls back to OCR for scanned pages.
from quivr.processing.extractors import PDFExtractor
extractor = PDFExtractor(
ocr_enabled=True,
ocr_language="eng",
extract_tables=True,
extract_images=False,
page_range=None # None means all pages
)
# Extract text from a PDF file
result = extractor.extract("docs/annual-report.pdf")
print(f"Pages processed: {result.page_count}")
print(f"Extraction method: {result.method}") # 'text' or 'ocr'
print(f"Tables found: {len(result.tables)}")
print(f"Characters extracted: {len(result.text)}")
# Access per-page content
for page in result.pages:
print(f"Page {page.number}: {len(page.text)} chars, "
f"method={page.extraction_method}")from quivr.processing.extractors import PDFExtractor
from quivr.processing.ocr import TesseractOCR
# Configure OCR with preprocessing
ocr_engine = TesseractOCR(
language="eng+fra", # Multi-language support
dpi=300, # Resolution for image conversion
preprocessing={
"deskew": True, # Correct page rotation
"denoise": True, # Remove scanning noise
"binarize": True, # Convert to black/white
"contrast": 1.5 # Boost contrast
}
)
extractor = PDFExtractor(
ocr_engine=ocr_engine,
ocr_fallback=True, # Use OCR only when text layer is empty
min_text_threshold=50 # Minimum chars before triggering OCR
)
result = extractor.extract("docs/scanned-contract.pdf")
# Check quality metrics
for page in result.pages:
print(f"Page {page.number}: confidence={page.ocr_confidence:.2f}")
if page.ocr_confidence < 0.7:
print(f" WARNING: Low OCR confidence on page {page.number}")Web content requires careful extraction to separate meaningful text from navigation, footers, and boilerplate.
from quivr.processing.extractors import HTMLExtractor
extractor = HTMLExtractor(
remove_elements=["nav", "footer", "header", "script", "style"],
preserve_headings=True,
preserve_lists=True,
preserve_tables=True,
extract_links=True,
main_content_selector="article, main, .content" # CSS selector
)
# Extract from a local HTML file
result = extractor.extract("docs/documentation.html")
print(f"Title: {result.title}")
print(f"Headings found: {len(result.headings)}")
print(result.text[:500])
# Extract from a URL
url_result = extractor.extract_from_url(
"https://docs.example.com/guide",
follow_links=False,
timeout=30
)from quivr.processing.extractors import DocxExtractor, PptxExtractor
# Word documents
docx_extractor = DocxExtractor(
preserve_formatting=True,
extract_comments=True,
extract_headers_footers=False
)
docx_result = docx_extractor.extract("docs/proposal.docx")
print(f"Sections: {len(docx_result.sections)}")
print(f"Tables: {len(docx_result.tables)}")
print(f"Comments: {len(docx_result.comments)}")
# PowerPoint presentations
pptx_extractor = PptxExtractor(
extract_notes=True,
extract_slide_text=True,
image_ocr=False
)
pptx_result = pptx_extractor.extract("docs/presentation.pptx")
for slide in pptx_result.slides:
print(f"Slide {slide.number}: {slide.title}")
print(f" Text: {slide.text[:100]}...")
if slide.notes:
print(f" Notes: {slide.notes[:100]}...")Raw extracted text is rarely ready for embedding. The cleaning pipeline handles normalization, noise removal, and structural preservation.
flowchart LR
A[Raw Text] --> B[Unicode Normalization]
B --> C[Whitespace Cleanup]
C --> D[Boilerplate Removal]
D --> E[Header/Footer Removal]
E --> F[Special Character Handling]
F --> G[Language Detection]
G --> H[Clean Text]
classDef step fill:#e8f5e8,stroke:#1b5e20
class A,B,C,D,E,F,G,H step
from quivr.processing.cleaners import (
TextCleaner,
UnicodeNormalizer,
WhitespaceCleaner,
BoilerplateRemover,
HeaderFooterRemover,
SpecialCharHandler
)
# Compose a cleaning pipeline
cleaner = TextCleaner(steps=[
UnicodeNormalizer(form="NFKC"),
WhitespaceCleaner(
collapse_newlines=True,
max_consecutive_newlines=2,
strip_trailing=True
),
BoilerplateRemover(
patterns=[
r"Page \d+ of \d+",
r"Confidential.*?Do not distribute",
r"^\s*\d+\s*$", # Page numbers alone on a line
]
),
HeaderFooterRemover(
header_lines=2,
footer_lines=2,
per_page=True
),
SpecialCharHandler(
replace_ligatures=True, # fi -> fi, fl -> fl
normalize_quotes=True, # smart quotes -> straight
normalize_dashes=True, # em/en dash -> hyphen
remove_control_chars=True
)
])
raw_text = extractor.extract("docs/report.pdf").text
clean_text = cleaner.clean(raw_text)
print(f"Before: {len(raw_text)} chars")
print(f"After: {len(clean_text)} chars")
print(f"Reduction: {(1 - len(clean_text)/len(raw_text))*100:.1f}%")import re
from quivr.processing.cleaners import CleanerStep
class LegalDocCleaner(CleanerStep):
"""Custom cleaner for legal documents."""
def __init__(self):
self.patterns = [
(r"\bSECTION\s+(\d+)", r"Section \1"), # Normalize headings
(r"\(\s*([a-z])\s*\)", r"(\1)"), # Fix spacing in lists
(r"(?<=\w)-\n(?=\w)", ""), # Rejoin hyphenated words
(r"\b(WHEREAS|HEREBY|THEREOF)\b", lambda m: m.group().title()),
]
def process(self, text: str) -> str:
for pattern, replacement in self.patterns:
text = re.sub(pattern, replacement, text)
return text
# Add the custom cleaner to the pipeline
cleaner.add_step(LegalDocCleaner())
clean_legal = cleaner.clean(raw_legal_text)Chunking is where most RAG pipelines succeed or fail. Too large and the chunks dilute relevant information; too small and you lose context. Quivr provides multiple chunking strategies for different use cases.
| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Fixed Size | General purpose | Simple, predictable | Cuts mid-sentence |
| Recursive Character | Most documents | Respects boundaries | Needs separator tuning |
| Sentence-Based | Articles, prose | Natural boundaries | Variable chunk sizes |
| Semantic | Technical docs | Meaning-preserving | Slower, model-dependent |
| Heading-Based | Structured docs | Section-aware | Requires headings |
| Sliding Window | Dense text | Overlapping context | More chunks, more cost |
from quivr.processing.chunkers import FixedSizeChunker
chunker = FixedSizeChunker(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between chunks
separator="\n" # Preferred split point
)
chunks = chunker.split(clean_text)
print(f"Total chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):
print(f"\nChunk {i}: {len(chunk.text)} chars")
print(f" Start: {chunk.text[:80]}...")
print(f" Metadata: page={chunk.metadata.get('page')}, "
f"position={chunk.metadata.get('position')}")This is the recommended default strategy. It tries to split at paragraph boundaries first, then sentences, then words.
from quivr.processing.chunkers import RecursiveChunker
chunker = RecursiveChunker(
chunk_size=1000,
chunk_overlap=200,
separators=[
"\n\n", # Paragraph breaks (highest priority)
"\n", # Line breaks
". ", # Sentence boundaries
", ", # Clause boundaries
" ", # Word boundaries (last resort)
],
keep_separator=True
)
chunks = chunker.split(clean_text)
print(f"Chunks: {len(chunks)}")Semantic chunking uses embeddings to find natural topic boundaries within the text.
from quivr.processing.chunkers import SemanticChunker
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
chunker = SemanticChunker(
embedding_model=embed_model,
min_chunk_size=200,
max_chunk_size=1500,
similarity_threshold=0.75, # Split when similarity drops below
buffer_size=1 # Sentences to look ahead/behind
)
chunks = chunker.split(clean_text)
for i, chunk in enumerate(chunks[:5]):
print(f"Chunk {i}: {len(chunk.text)} chars, "
f"coherence={chunk.metadata['coherence_score']:.2f}")from quivr.processing.chunkers import HeadingChunker
chunker = HeadingChunker(
heading_patterns=[
r"^#{1,3}\s+", # Markdown headings
r"^[A-Z][^.!?]*$", # ALL-CAPS or Title lines
r"^\d+\.\d*\s+\w", # Numbered sections like "1.2 Overview"
],
max_chunk_size=2000,
include_heading_in_chunk=True,
hierarchical_metadata=True # Track heading nesting
)
chunks = chunker.split(clean_text)
for chunk in chunks[:3]:
meta = chunk.metadata
print(f"Section: {meta.get('heading', 'N/A')}")
print(f" Level: {meta.get('heading_level', 'N/A')}")
print(f" Parent: {meta.get('parent_heading', 'N/A')}")
print(f" Length: {len(chunk.text)} chars")Every chunk carries metadata that helps with retrieval filtering, citation, and debugging.
from quivr.processing.chunkers import RecursiveChunker
from quivr.processing.metadata import MetadataEnricher
chunker = RecursiveChunker(chunk_size=1000, chunk_overlap=200)
enricher = MetadataEnricher()
chunks = chunker.split(clean_text)
# Enrich chunks with metadata
enriched_chunks = enricher.enrich(
chunks=chunks,
source_metadata={
"filename": "annual-report.pdf",
"source": "finance-team",
"uploaded_by": "jane@company.com",
"upload_date": "2024-06-15",
"document_type": "report",
"language": "en"
}
)
for chunk in enriched_chunks[:2]:
print(f"Chunk ID: {chunk.id}")
print(f" Text preview: {chunk.text[:80]}...")
print(f" Metadata:")
for key, value in chunk.metadata.items():
print(f" {key}: {value}")# Each chunk contains the following metadata structure
chunk_metadata = {
# Source information
"filename": "report.pdf",
"source": "finance-team",
"document_type": "report",
# Position information
"chunk_index": 5,
"total_chunks": 42,
"page_number": 3,
"start_char": 4200,
"end_char": 5180,
# Content information
"heading": "Revenue Analysis",
"parent_heading": "Financial Summary",
"language": "en",
"word_count": 182,
"has_table": False,
"has_code": False,
# Processing information
"chunker": "recursive",
"chunk_size": 1000,
"chunk_overlap": 200,
"cleaning_steps": ["unicode", "whitespace", "boilerplate"],
"processed_at": "2024-06-15T10:30:00Z"
}Here is a complete pipeline that ties extraction, cleaning, and chunking together.
from quivr.processing.pipeline import DocumentPipeline
from quivr.processing.extractors import PDFExtractor, HTMLExtractor
from quivr.processing.cleaners import TextCleaner, UnicodeNormalizer, WhitespaceCleaner
from quivr.processing.chunkers import RecursiveChunker
from quivr.processing.metadata import MetadataEnricher
# Build the pipeline
pipeline = DocumentPipeline(
extractors={
".pdf": PDFExtractor(ocr_enabled=True),
".html": HTMLExtractor(remove_elements=["nav", "footer"]),
},
cleaner=TextCleaner(steps=[
UnicodeNormalizer(form="NFKC"),
WhitespaceCleaner(collapse_newlines=True),
]),
chunker=RecursiveChunker(
chunk_size=1000,
chunk_overlap=200
),
enricher=MetadataEnricher()
)
# Process a batch of files
import glob
files = glob.glob("docs/**/*.*", recursive=True)
all_chunks = []
for file_path in files:
try:
chunks = pipeline.process(
file_path=file_path,
metadata={"source": "docs-folder"}
)
all_chunks.extend(chunks)
print(f"Processed {file_path}: {len(chunks)} chunks")
except Exception as e:
print(f"ERROR processing {file_path}: {e}")
print(f"\nTotal chunks: {len(all_chunks)}")
print(f"Total files: {len(files)}")
print(f"Avg chunks/file: {len(all_chunks)/len(files):.1f}")from quivr.processing.pipeline import DocumentPipeline
from quivr.processing.callbacks import ProgressCallback
class IngestionMonitor(ProgressCallback):
"""Track ingestion progress and quality metrics."""
def __init__(self):
self.stats = {
"files_processed": 0,
"files_failed": 0,
"total_chunks": 0,
"total_chars": 0,
"ocr_pages": 0,
"low_quality_pages": 0
}
def on_file_start(self, file_path: str):
print(f"Processing: {file_path}")
def on_file_complete(self, file_path: str, chunks: list):
self.stats["files_processed"] += 1
self.stats["total_chunks"] += len(chunks)
self.stats["total_chars"] += sum(len(c.text) for c in chunks)
def on_file_error(self, file_path: str, error: Exception):
self.stats["files_failed"] += 1
print(f" FAILED: {error}")
def on_ocr_page(self, page_num: int, confidence: float):
self.stats["ocr_pages"] += 1
if confidence < 0.7:
self.stats["low_quality_pages"] += 1
def report(self):
s = self.stats
print(f"\n{'='*50}")
print(f"Ingestion Report")
print(f"{'='*50}")
print(f"Files processed: {s['files_processed']}")
print(f"Files failed: {s['files_failed']}")
print(f"Total chunks: {s['total_chunks']}")
print(f"Total chars: {s['total_chars']:,}")
print(f"OCR pages: {s['ocr_pages']}")
print(f"Low quality: {s['low_quality_pages']}")
monitor = IngestionMonitor()
pipeline = DocumentPipeline(callbacks=[monitor])
for file_path in files:
pipeline.process(file_path)
monitor.report()| Problem | Cause | Solution |
|---|---|---|
| Garbled text from PDF | Scanned pages without OCR | Enable ocr_fallback=True |
| Empty chunks generated | Short pages or heavy boilerplate | Set min_chunk_size=100 and filter |
| Duplicate content in chunks | Excessive overlap | Reduce chunk_overlap to 10-20% of chunk_size |
| Missing tables | Extractor not configured | Set extract_tables=True on extractor |
| Encoding errors | Non-UTF-8 source files | Use chardet to detect and convert encoding |
| Slow processing on large PDFs | Page-by-page OCR | Process pages in parallel with n_workers |
| Heading structure lost | Plain text extraction | Use preserve_headings=True on extractor |
from quivr.processing.diagnostics import ExtractionAnalyzer
analyzer = ExtractionAnalyzer()
# Analyze extraction quality for a document
report = analyzer.analyze(
file_path="docs/report.pdf",
extracted_text=result.text,
chunks=chunks
)
print(f"Extraction coverage: {report.coverage_pct:.1f}%")
print(f"Empty pages: {report.empty_pages}")
print(f"Average chunk quality: {report.avg_chunk_quality:.2f}")
print(f"Duplicate chunks: {report.duplicate_count}")
# Visualize chunk size distribution
report.plot_chunk_distribution()Document processing is the foundation of any RAG pipeline. In this chapter you learned how Quivr's processing engine works from end to end:
- Format Detection automatically routes files to the correct extractor
- Text Extraction handles PDF (with OCR fallback), HTML, Office, and image formats
- Cleaning Pipelines normalize unicode, remove boilerplate, and fix formatting
- Chunking Strategies range from simple fixed-size to semantic-aware splitting
- Metadata Enrichment tags every chunk with provenance and position information
- Quality Monitoring tracks extraction coverage and chunk quality metrics
- Choose the right chunking strategy -- recursive character splitting is a safe default, but semantic chunking yields better results for technical documents.
- Always clean before chunking -- boilerplate and formatting noise degrade retrieval quality.
- Preserve metadata -- chunk provenance is essential for citations and debugging.
- Monitor extraction quality -- use the diagnostics analyzer to catch OCR failures and empty chunks early.
- Tune chunk size and overlap -- start with 1000 characters and 200 overlap, then adjust based on your retrieval performance.
Your documents are now extracted, cleaned, and chunked. In Chapter 3: Vector Embeddings, you will learn how to convert these chunks into vector embeddings and store them in a vector database for fast similarity search.
Built with insights from the Quivr project.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for print, chunks, processing so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 2: Document Processing as an operating subsystem inside Quivr Tutorial: Open-Source RAG Framework for Document Ingestion, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around quivr, text, chunk as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 2: Document Processing usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
print. - Input normalization: shape incoming data so
chunksreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
processing. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com). - AI Codebase Knowledge Builder
Why it matters: authoritative reference on
AI Codebase Knowledge Builder(github.com).
Suggested trace strategy:
- search upstream code for
printandchunksto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production