DocIntel Pro is a high-throughput, Agentic Document Processing System designed to transform unstructured documents (PDFs, Images) into structured, actionable intelligence. It leverages state-of-the-art Visual Language Models (VLMs) like Qwen2-VL (via Fireworks AI) to perform unified layout analysis, OCR, and semantic understanding in a single pass.
"Visual First, Text Second." Unlike traditional OCR pipelines that lose context, DocIntel Pro understands the visual structure of your document—charts, tables, seals, and handwriting—preserving the semantic meaning of your data.
The system follows a Cloud-Native Microservices Architecture, orchestrated by a central workflow engine.
flowchart TB
User((User)) -->|Uploads Doc| UI["Frontend (Next.js)"]
UI -->|POST /analyze| Orch{"Orchestrator Service"}
subgraph Core_Services [Backend Services]
direction TB
Orch -->|1. Convert/Normalize| Preprocess[Preprocessing Service]
Orch -->|2. Parallel Analysis| Visual[Visual Intelligence Service]
subgraph Visual_Logic [Visual Logic]
Visual -->|Extract| VLM["Fireworks AI (Qwen2-VL)"]
VLM -->|JSON| Parser[Layout Parser]
end
end
Preprocess -->|Clean Images| Orch
Visual -->|Structured Blocks| Orch
Orch -->|3. Aggregate & Order| Response[Final JSON Response]
Response --> UI
| Service | Port | Tech Stack | Responsibilities |
|---|---|---|---|
| Frontend | :3000 |
Next.js, React, Tailwind, Shadcn/UI | Modern, responsive UI for uploading documents and visualizing results with bounding boxes. |
| Orchestrator | :8000 |
FastAPI, Python, AsyncIO | Workflow engine. Handles file intake, routing to workers, error handling, and results aggregation. |
| Preprocessing | :8001 |
FastAPI, OpenCV, pdf2image | CPU-bound image operations: PDF-to-Image conversion, Denoising, Deskewing, Normalization. |
| Visual Intelligence | :8002 |
FastAPI, Fireworks AI SDK | GPU-accelerated inference. Interfaces with VLMs to detect layout, extract text, and recognize tables/figures in one step. |
- Unified Analysis: Uses Vision-Language Models to perform Layout Analysis and OCR simultaneously.
- Multi-Page Support: Handles multi-page PDFs with parallel processing for high performance.
- Visual Grounding: Returns precise bounding boxes for every detected element (Text, Tables, Figures).
- Resilient Architecture: Exponential backoff and retries for external API calls; isolated services preventing cascade failures.
- Modern UI: A beautiful, responsive interface with real-time feedback and visual overlays.
- Python 3.10+ (Recommended: Use Conda)
- Node.js 18+ &
npm/bun - PopplerUtils (Required for PDF processing)
- Ubuntu:
sudo apt-get install poppler-utils - MacOS:
brew install poppler
- Ubuntu:
- Fireworks AI API Key: Get one at fireworks.ai
Clone the repository and create a virtual environment:
git clone https://github.com/yourusername/docintel-pro.git
cd docintel-pro
# Create Conda Environment
conda create -n doc_analysis_env python=3.10 -y
conda activate doc_analysis_env
# Install Backend Dependencies
pip install -r preprocessing_service/requirements.txt
pip install -r visual_service/requirements.txt
pip install -r orchestrator/requirements.txt (if present, or install commonly used libs)
# Note: Ensure you have standard libs: fastapi, uvicorn, python-multipart, httpx, opencv-python-headless, pdf2image, fireworks-aiCreate a .env file in the root directory (or ensure services indicate where to look).
For the Visual Service, export your API Key:
export FIREWORKS_API_KEY="your_api_key_here"It is recommended to run each service in a separate terminal window or use a process manager (like Supervisord or Docker Compose - coming soon).
Terminal 1: Preprocessing Service
uvicorn preprocessing_service.main:app --port 8001 --reloadTerminal 2: Visual Intelligence Service
uvicorn visual_service.main:app --port 8002 --reloadTerminal 3: Orchestrator
uvicorn orchestrator.main:app --port 8000 --reloadTerminal 4: Frontend
cd frontend
npm install
npm run devThe UI will be available at http://localhost:3000.
- Docker Compose: One-click startup for the entire stack.
- Database Integration: Persist analysis results (PostgreSQL/MongoDB).
- Streaming Responses: Real-time progress updates for long documents.
- Local Inference: Support for local VLMs (e.g., Llava, BakLLaVA) via Ollama.
Contributions are welcome! Please fork the repository and submit a Pull Request.
This project is licensed under the MIT License.