DocuTrace is a specialized auditing tool designed for high-stakes domains (Legal, Finance, Compliance) where AI hallucinations are unacceptable.
Unlike standard RAG (Retrieval Augmented Generation) pipelines that summarize text, DocuTrace utilizes Google's LangExtract library to perform Source Grounding. Every extracted data point is cryptographically linked to its specific coordinates in the source PDF, generating an interactive HTML audit trail.
Building a verifiable extraction engine involves overcoming significant "dependency hell" and rate-limiting barriers.
- The Challenge: We deployed using the standard
python:3.10-slimDocker image to minimize boot time. However, theLangExtractlibrary requires installation directly from GitHub source, which depends ongit. The build failed withExecutableNotFound: git. - The Solution: Architected a custom multi-stage Dockerfile that injects system-level dependencies (
apt-get install git) before the Python environment initializes, ensuring a successful build without bloating the final image.
- The Challenge: Financial 10-K reports often contain scanned pages or complex layouts that break standard OCR tools.
- The Solution: Implemented a robust
pypdfpre-processing layer with a "Fail-Fast" mechanism. The system validates text density before passing data to the LLM, preventing wasted API tokens on unreadable files.
- The Challenge: LangExtract is designed for enterprise usage (Vertex AI) and attempts parallel chunk processing. On the Gemini Free Tier, this triggered
429 Resource Exhaustederrors immediately. - The Solution: Implemented an "Extraction Window" logic in the UI. Users select specific page ranges (e.g., "Risk Factors, Pages 15-20") rather than processing the entire 100-page document at once, keeping the request volume within the 15 RPM limit.
graph TD
User[Financial Analyst] -->|Upload 10-K PDF| UI[Streamlit Interface]
User -->|Define Schema| UI
subgraph Extraction Engine
UI -->|Raw Text| Preprocessor[PyPDF Chuncker]
Preprocessor -->|Context Blocks| LangExtract[LangExtract Library]
LangExtract -->|Inference Req| Gemini[Gemini 2.5 Flash]
Gemini -->|Structured Data| LangExtract
end
LangExtract -->|HTML Generation| Visualizer[Interactive Highlighter]
Visualizer -->|Iframe Render| UI
- Frontend (Streamlit): Handles file ingestion and renders the output within a secure container.
- Intelligence (Gemini 2.5 Flash): Selected for its 1M token context window, allowing it to hold large document sections in working memory without RAG retrieval loss.
- Grounding (LangExtract): Maps the LLM's JSON output back to the original PDF text spans using fuzzy matching algorithms.
This is a production-grade Proof of Concept (PoC) with specific constraints:
- Text-Only Extraction: The current pipeline extracts text. It does not parse charts, graphs, or tables (requires multimodal vision upgrade).
- API Quotas: The Live Demo operates on the Google Gemini Free Tier. Heavy usage may trigger temporary cooldowns (429 Errors).
- Session State: For privacy, all files are processed in ephemeral memory and discarded immediately after the session ends. No data is persisted.
- Python 3.10+
- Google Gemini API Key
# 1. Clone the repository
git clone https://github.com/eatosin/DocuTrace-AI-Auditor.git
cd DocuTrace-AI-Auditor
# 2. Install dependencies (requires Git installed)
pip install -r requirements.txt
# 3. Configure Environment
export GEMINI_API_KEY="your_key_here"
# 4. Run the App
streamlit run app.pyOwadokun Tosin Tobi Senior AI Engineer | Specialist in MLOps & LLM Evaluation
- Portfolio: ReasonBench, Sentinel
- Connect: LinkedIn
Built with Python, Google Cloud AI, and Engineering Rigor.
