Advanced Document Layout Analysis & Structured Output Generation
A powerful AI-driven tool for archivists to digitize, structure, and interact with documents using advanced OCR and vision models.
Try InterPARES-Vision online: demos.dlnlp.ai/InterPARES/
No installation required - access the full functionality through your web browser!
InterPARES-Vision is an advanced OCR (Optical Character Recognition) and layout analysis tool designed specifically for archival documents. It combines state-of-the-art AI vision models to extract text, preserve document structure, and generate machine-readable outputs from scanned documents and images.
- Document Structure Understanding: Identifies headings, paragraphs, tables, lists, and maintains proper reading order
- Interactive AI Chat: Ask questions about parsed documents using natural language
- Metadata Extraction: Request translations, summaries, and structured metadata through conversational queries
- Multi-format Output: Generate Markdown, JSON, and annotated visualizations
- Batch Processing: Handle multi-page PDFs with consistent quality
- Layout Detection: Identifies document regions including text blocks, tables, images, headings, and captions
- OCR Text Extraction: Extracts text with high accuracy, even from degraded or complex documents
- Structure Preservation: Maintains document hierarchy and reading order
- Multi-page PDF Support: Process entire PDF documents with page-by-page analysis
- Interactive Visualization: View detected layout regions overlaid on original documents
- Natural Language Queries: Ask questions about document content in plain language
- Metadata Generation: Extract structured metadata in JSON format for archival systems
- Translation Support: Request translations of document sections or entire documents
- Summarization: Get concise summaries and key information extraction
- Classification Assistance: Identify document types and suggest archival categories
- Markdown: Formatted text with preserved structure and hierarchy
- JSON: Structured data with bounding boxes, element types, and coordinates
- Annotated Images: Visual overlay showing detected layout regions with color-coded boxes
- Downloadable Results: ZIP package with all output formats and original files
| Format | Extensions | Description |
|---|---|---|
| PDF Documents | .pdf |
Multi-page or single-page PDF files (processed page-by-page) |
| Images | .jpg, .jpeg, .png |
Scanned images or photographs of documents |
π‘ Best Results: Use high-resolution scans (200+ DPI) with good contrast and minimal skew.
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for optimal performance)
- 8GB+ RAM (16GB+ recommended for large documents)
# Clone the repository
git clone https://github.com/your-org/InterPARES-vision.git
cd InterPARES-vision
# Install dependencies
pip install -r requirements.txt
# Install DotsOCR parser
pip install dots-ocr# Start the application (default port: 7860)
python app.py 7860
# Or specify a custom port
python app.py 8080The application will be available at http://localhost:7860 (or your specified port).
Option A: Use Example Documents
- Click on any thumbnail in the "π₯ Select Example Document" gallery
- Browse through available examples using Previous/Next buttons
Option B: Upload Your Own
- Click "π Upload PDF or Image" button
- Select a file from your computer (PDF, JPG, PNG)
For PDF files:
- Use β¬ Previous and Next β‘ buttons to browse pages
- View current position with page counter (e.g., "2 / 10")
| Mode | Description | Best For |
|---|---|---|
| prompt_layout_all_en | Full analysis: layout + OCR + reading order | Complex documents with mixed content |
| prompt_layout_only_en | Layout detection without text extraction | Understanding document organization |
| prompt_ocr | OCR-focused with minimal layout | Simple text documents |
π‘ Recommendation: Start with prompt_layout_all_en for comprehensive analysis.
Click π Parse to begin processing. The system will:
- Analyze document layout
- Extract text from detected regions
- Generate structured output in multiple formats
Results appear in three tabs:
- Markdown Render Preview: Human-readable formatted view
- Markdown Raw Text: Plain Markdown with formatting codes
- Current Page JSON: Structured data with coordinates and element types
After parsing, use the AI chat feature:
Example Questions:
- "Extract the main keywords for archival indexing"
- "What is the document type and subject matter?"
- "Extract metadata in JSON format"
- "Translate the summary section into French"
- "List all dates, names, and locations mentioned"
Click β¬οΈ Download Results to get a ZIP file containing:
- Layout images with annotations
- JSON files with structured data
- Markdown files with formatted text
- Original input file
- Digitization Projects: Convert scanned documents to searchable, structured text
- Metadata Extraction: Automatically generate catalog records and finding aids
- Collection Assessment: Rapidly evaluate document content and significance
- Multilingual Access: Translate documents for broader accessibility
- Data Extraction: Pull structured information from historical records
- Classification Support: AI-assisted document type and subject identification
- β Use consistent scan settings (200+ DPI) for optimal results
- β Process similar document types together with the same prompt mode
- β Review sample outputs (5-10%) from each batch for quality assurance
- β Keep original scans alongside OCR outputs in your digital repository
- β Document processing settings (tool version, prompt mode, date) in metadata
- β Verify AI-generated metadata against professional archival standards
Default settings in app.py:
DEFAULT_CONFIG = {
'ip': "127.0.0.1",
'port_vllm': 8001,
'min_pixels': MIN_PIXELS,
'max_pixels': MAX_PIXELS,
'test_images_dir': "./assets/showcase_origin",
}The chat feature uses vLLM with OpenAI-compatible API:
chat_client = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
model="Qwen3-4B-Instruct-2507-FP8",
temperature=0.1,
max_tokens=16000,
streaming=True
)For detailed documentation, see:
- User Guide - Comprehensive feature walkthrough
- API Documentation - Developer reference (coming soon)
- Archival Workflows - Best practices guide (coming soon)
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -r requirements-dev.txt
- Live Demo: demos.dlnlp.ai/InterPARES/
- Issues: GitHub Issues