Skip to content

InterPARES-Vision is an advanced OCR (Optical Character Recognition) and layout analysis tool designed specifically for archival documents. It combines state-of-the-art AI vision models to extract text, preserve document structure, and generate machine-readable outputs from scanned documents and images.

Notifications You must be signed in to change notification settings

UBC-NLP/InterPARES_vision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” InterPARES-Vision

Advanced Document Layout Analysis & Structured Output Generation

Live Demo Python License

A powerful AI-driven tool for archivists to digitize, structure, and interact with documents using advanced OCR and vision models.

🌐 Live Demo

Try InterPARES-Vision online: demos.dlnlp.ai/InterPARES/

No installation required - access the full functionality through your web browser!


πŸ“– Overview

InterPARES-Vision is an advanced OCR (Optical Character Recognition) and layout analysis tool designed specifically for archival documents. It combines state-of-the-art AI vision models to extract text, preserve document structure, and generate machine-readable outputs from scanned documents and images.

πŸ’‘ Key Capabilities

  • Document Structure Understanding: Identifies headings, paragraphs, tables, lists, and maintains proper reading order
  • Interactive AI Chat: Ask questions about parsed documents using natural language
  • Metadata Extraction: Request translations, summaries, and structured metadata through conversational queries
  • Multi-format Output: Generate Markdown, JSON, and annotated visualizations
  • Batch Processing: Handle multi-page PDFs with consistent quality

✨ Features

🎯 Core Features

  • Layout Detection: Identifies document regions including text blocks, tables, images, headings, and captions
  • OCR Text Extraction: Extracts text with high accuracy, even from degraded or complex documents
  • Structure Preservation: Maintains document hierarchy and reading order
  • Multi-page PDF Support: Process entire PDF documents with page-by-page analysis
  • Interactive Visualization: View detected layout regions overlaid on original documents

πŸ’¬ AI-Powered Chat

  • Natural Language Queries: Ask questions about document content in plain language
  • Metadata Generation: Extract structured metadata in JSON format for archival systems
  • Translation Support: Request translations of document sections or entire documents
  • Summarization: Get concise summaries and key information extraction
  • Classification Assistance: Identify document types and suggest archival categories

πŸ“Š Output Formats

  1. Markdown: Formatted text with preserved structure and hierarchy
  2. JSON: Structured data with bounding boxes, element types, and coordinates
  3. Annotated Images: Visual overlay showing detected layout regions with color-coded boxes
  4. Downloadable Results: ZIP package with all output formats and original files

πŸ“ Supported File Types

Format Extensions Description
PDF Documents .pdf Multi-page or single-page PDF files (processed page-by-page)
Images .jpg, .jpeg, .png Scanned images or photographs of documents

πŸ’‘ Best Results: Use high-resolution scans (200+ DPI) with good contrast and minimal skew.


πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • CUDA-compatible GPU (recommended for optimal performance)
  • 8GB+ RAM (16GB+ recommended for large documents)

Installation

# Clone the repository
git clone https://github.com/your-org/InterPARES-vision.git
cd InterPARES-vision

# Install dependencies
pip install -r requirements.txt

# Install DotsOCR parser
pip install dots-ocr

Running the Application

# Start the application (default port: 7860)
python app.py 7860

# Or specify a custom port
python app.py 8080

The application will be available at http://localhost:7860 (or your specified port).


πŸ“˜ Usage Guide

1️⃣ Select or Upload a Document

Option A: Use Example Documents

  • Click on any thumbnail in the "πŸ“₯ Select Example Document" gallery
  • Browse through available examples using Previous/Next buttons

Option B: Upload Your Own

  • Click "πŸ“ Upload PDF or Image" button
  • Select a file from your computer (PDF, JPG, PNG)

2️⃣ Navigate Multi-Page Documents

For PDF files:

  • Use β¬… Previous and Next ➑ buttons to browse pages
  • View current position with page counter (e.g., "2 / 10")

3️⃣ Choose a Prompt Mode

Mode Description Best For
prompt_layout_all_en Full analysis: layout + OCR + reading order Complex documents with mixed content
prompt_layout_only_en Layout detection without text extraction Understanding document organization
prompt_ocr OCR-focused with minimal layout Simple text documents

πŸ’‘ Recommendation: Start with prompt_layout_all_en for comprehensive analysis.

4️⃣ Parse the Document

Click πŸ” Parse to begin processing. The system will:

  • Analyze document layout
  • Extract text from detected regions
  • Generate structured output in multiple formats

5️⃣ View Results

Results appear in three tabs:

  • Markdown Render Preview: Human-readable formatted view
  • Markdown Raw Text: Plain Markdown with formatting codes
  • Current Page JSON: Structured data with coordinates and element types

6️⃣ Ask Questions (πŸ’¬ Interactive Chat)

After parsing, use the AI chat feature:

Example Questions:
- "Extract the main keywords for archival indexing"
- "What is the document type and subject matter?"
- "Extract metadata in JSON format"
- "Translate the summary section into French"
- "List all dates, names, and locations mentioned"

7️⃣ Download Results

Click ⬇️ Download Results to get a ZIP file containing:

  • Layout images with annotations
  • JSON files with structured data
  • Markdown files with formatted text
  • Original input file

πŸ›οΈ Archival Applications

Use Cases for Archivists

  1. Digitization Projects: Convert scanned documents to searchable, structured text
  2. Metadata Extraction: Automatically generate catalog records and finding aids
  3. Collection Assessment: Rapidly evaluate document content and significance
  4. Multilingual Access: Translate documents for broader accessibility
  5. Data Extraction: Pull structured information from historical records
  6. Classification Support: AI-assisted document type and subject identification

Best Practices

  • βœ… Use consistent scan settings (200+ DPI) for optimal results
  • βœ… Process similar document types together with the same prompt mode
  • βœ… Review sample outputs (5-10%) from each batch for quality assurance
  • βœ… Keep original scans alongside OCR outputs in your digital repository
  • βœ… Document processing settings (tool version, prompt mode, date) in metadata
  • βœ… Verify AI-generated metadata against professional archival standards

πŸ”§ Configuration

Server Configuration

Default settings in app.py:

DEFAULT_CONFIG = {
    'ip': "127.0.0.1",
    'port_vllm': 8001,
    'min_pixels': MIN_PIXELS,
    'max_pixels': MAX_PIXELS,
    'test_images_dir': "./assets/showcase_origin",
}

Chat Model Configuration

The chat feature uses vLLM with OpenAI-compatible API:

chat_client = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model="Qwen3-4B-Instruct-2507-FP8",
    temperature=0.1,
    max_tokens=16000,
    streaming=True
)

πŸ“š Documentation

For detailed documentation, see:


Development Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -r requirements-dev.txt


πŸ“ž Support


About

InterPARES-Vision is an advanced OCR (Optical Character Recognition) and layout analysis tool designed specifically for archival documents. It combines state-of-the-art AI vision models to extract text, preserve document structure, and generate machine-readable outputs from scanned documents and images.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages