RPG Document Service

A FastAPI service for document processing and workflow automation in accounting.

This service provides a FastAPI backend to:

Upload documents (PDF, JPG, PNG)
Classify documents using a hybrid AI approach
Extract structured data using a vision-language model (VLM)
Manage document intake workflows with checklists

Built with FastAPI, Docker, and Hugging Face Transformers.

Project Objective

The goal of this project is to build a core component of accounting workflows: document understanding. The service implements a complete pipeline that ingests, classifies, and extracts data from common accounting-related documents, automating a traditionally manual process.

Core Workflow

The application follows a structured workflow designed to mirror a real-world accounting process:

Create Client: A client is registered with the system. Each client has a complexity level (simple, average, complex) which determines the number of documents expected for a given tax year.
Create Intake: An Intake is created for a client for a specific fiscal_year. This acts as a container for all documents related to that work session. Upon creation, a checklist is automatically generated based on the client's complexity.
Upload Documents: Documents (PDF, JPG, PNG) are uploaded to a specific intake. The system calculates a SHA256 hash to prevent duplicate files within the same intake.
Classify Documents: All unclassified documents in an intake are processed by the hybrid classification engine, which determines the doc_kind (e.g., T4, id, receipt).
Extract Data: Once classified, the vision-language model extracts key fields from the documents (e.g., employer_name from a T4, total_amount from a receipt).
Update Checklist: As data is successfully extracted from documents, the corresponding items on the intake's checklist are marked as received. Once all required documents are received, the intake's status is automatically set to done.

Features

Hybrid Document Classification: Combines filename analysis, regex pattern matching, and a BERT-based zero-shot classification model for robust and explainable document identification.
VLM Data Extraction: Utilizes the Qwen3-VL-4B vision-language model to extract structured data directly from document images.
GPU Acceleration: Supports NVIDIA GPUs for both classification and extraction, significantly speeding up processing times.
Efficient Memory Management: Dynamically loads and unloads AI models to prevent GPU out-of-memory errors when processing multiple large documents.
PDF & Image Support: Automatically converts PDFs to images for VLM processing.
RESTful API: A clean, modern API built with FastAPI, including automatically generated documentation.
Modern Web GUI: A simple, intuitive web interface for interacting with the entire workflow, from client creation to data extraction.

Key Features (Scalability and Future Deployment)

The service includes several production-ready optimizations and can be easily deployed on cloud infrastructure:

Performance Optimizations: Intelligent model caching reduces processing time by 12x, automatic image resizing handles large files efficiently, and optimized token limits ensure complete data extraction
Production Architecture: Docker containerization with persistent volumes, built-in health checks, and PostgreSQL database with automatic connection management
Cloud Deployment Ready: Stateless design enables horizontal scaling, native GPU support for accelerated processing, and detailed logging for monitoring
Developer Experience: Hot reload development, comprehensive error handling, and clean modular architecture with full type safety
Scalable Design: Microservice architecture supports easy service extraction, API-first design enables external integrations, and plugin architecture allows adding new document types

Architecture

┌─────────────────────────────────────────────────────────┐
│                  WORKFLOW STAGES                        │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. CLASSIFICATION (Hybrid: Regex + BERT on GPU)        │
│     ├─ Load BERT model (~1.5 GB RAM)                    │
│     ├─ Classify all documents                           │
│     └─ **UNLOAD MODEL**                                 │
│                                                         │
│  2. GPU MEMORY CLEARED                                  │
│     └─ torch.cuda.empty_cache()                         │
│                                                         │
│  3. EXTRACTION (Qwen3-VL on GPU)                        │
│     ├─ Load Qwen3-VL-4B (~10 GB VRAM)                   │
│     ├─ Convert PDF→Image if needed                      │
│     ├─ Extract data from each document                  │
│     └─ Unload model between documents if needed         │
│                                                         │
└─────────────────────────────────────────────────────────┘

Requirements

This project is designed to be run with Docker, which simplifies dependency management. The following are required on your host machine:

1. Docker & Docker Compose

Purpose: To build and run the application services (FastAPI app, PostgreSQL database) in isolated containers.
Setup: Follow the official installation instructions for your operating system.

2. NVIDIA GPU & NVIDIA Container Toolkit (Recommended)

Purpose: To provide GPU acceleration for the AI models (both classification and extraction). While the application can run on CPU, performance will be significantly slower (e.g., extractions may take minutes instead of seconds).
GPU: An NVIDIA GPU with at least 12GB of VRAM is recommended to run the 4B vision model.
Setup: You must install the NVIDIA Container Toolkit to allow Docker containers to access the GPU.

3. Make (Optional)

Purpose: The Makefile provides convenient shortcuts for common commands like make up and make down.
Setup: This is typically pre-installed on Linux and macOS. For Windows, you can use docker-compose commands directly.

Internal Dependencies (Handled by Docker)

You do not need to install these on your local machine if you are using Docker, as they are provisioned automatically within the container:

Tesseract OCR: An OCR engine used to extract plain text from documents. This text is then used by the classification models (Regex and BERT).
Poppler Utilities: Provides PDF rendering capabilities, allowing the application to convert PDF files into images for processing by the vision models.
Python 3.11: The core programming language for the application.
PyTorch & Transformers: The primary libraries for running the AI models.

Getting Started

Setup

Clone the repository.
Start the services using Docker Compose:
```
make up
```
This command will build the Docker image (which may take a few minutes on the first run as it downloads dependencies and AI models) and start the application.
Access the services:
- Web GUI: http://localhost:8000/
- API Docs (Swagger): http://localhost:8000/docs

Usage

Use the web GUI at http://localhost:8000/ to follow the step-by-step workflow:

Create a Client: Define a new client and their complexity level.
Create an Intake: Start a new document intake process for a client.
Upload Documents: Upload PDFs, JPGs, or PNGs.
Process Documents:
- Classify: Run the hybrid classification to identify document types.
- Extract: Run VLM data extraction on the classified documents.
- View Checklist: Check the status of the intake process.

Domain Models

The application uses a set of Pydantic and SQLAlchemy models to represent the core business logic:

Client: Represents an individual whose documents are being processed.
- name: string
- email: string
- complexity: enum (simple | average | complex)
Intake: Represents a work session for a client for a specific fiscal year.
- client_id: foreign key to Client
- fiscal_year: int
- status: enum (open | done)
Document: Represents a file that has been uploaded.
- intake_id: foreign key to Intake
- filename: string
- sha256: string (for duplicate detection)
- doc_kind: enum (T4 | receipt | id | unknown)
ChecklistItem: Represents a required document for an intake.
- intake_id: foreign key to Intake
- doc_kind: enum (the required document type)
- status: enum (missing | received)

API Endpoints

The service exposes a RESTful API for interacting with the workflow.

POST /clients: Creates a new client.
POST /intakes: Creates a new intake for a client and initializes its checklist.
POST /intakes/{id}/documents: Uploads a document to a specific intake.
POST /documents/{document_id}/classify: Classifies a single document.
POST /intakes/{id}/classify: Classifies all unclassified documents in an intake.
POST /documents/{document_id}/extract: Extracts data from a single, classified document.
POST /intakes/{id}/extract: Extracts data from all classified documents in an intake.
GET /intakes/{id}/checklist: Retrieves the current status of an intake's checklist.
GET /documents/{document_id}/raw_text: Retrieves the raw OCR text from a document.

API Usage (with cURL)

Here is a complete workflow example using curl commands.

1. Create a Client

First, create a client. The API will return the client's details, including the id you'll need for the next step.

curl -X POST "http://localhost:8000/clients" \
     -H "Content-Type: application/json" \
     -d '{
       "name": "John Doe",
       "email": "john.doe@example.com",
       "complexity": "simple"
     }'

2. Create an Intake

Using the client_id from the previous step, create an intake for a specific fiscal year.

# Replace {client_id} with the actual ID from step 1
curl -X POST "http://localhost:8000/intakes" \
     -H "Content-Type: application/json" \
     -d '{
       "client_id": 1,
       "fiscal_year": 2024
     }'

3. Upload Documents

Upload the necessary documents to the intake. You can upload multiple files.

# Replace {intake_id} with the actual ID from step 2
curl -X POST "http://localhost:8000/intakes/1/documents" \
     -F "file=@sample_docs/T4_sample.pdf"

curl -X POST "http://localhost:8000/intakes/1/documents" \
     -F "file=@sample_docs/drivers_license.jpg"

4. Classify Documents

Run the hybrid classification process on all unclassified documents in the intake. This will use the BERT model on the CPU.

# Replace {intake_id} with the actual ID from step 2
curl -X POST "http://localhost:8000/intakes/1/classify"

5. Extract Data

Run the data extraction process. This will load the Qwen3-VL model on the GPU and extract structured data from the classified documents.

# Replace {intake_id} with the actual ID from step 2
curl -X POST "http://localhost:8000/intakes/1/extract"

6. Check the Checklist

Finally, check the intake's checklist to see the final status.

# Replace {intake_id} with the actual ID from step 2
curl -X GET "http://localhost:8000/intakes/1/checklist"

Technologies Used

Backend: FastAPI, Python 3.11
AI Models:
- facebook/bart-large-mnli for classification
- Qwen/Qwen3-VL-4B-Instruct-FP8 for extraction
AI Frameworks: Hugging Face Transformers, PyTorch
OCR: Tesseract
Database: PostgreSQL
Containerization: Docker, Docker Compose
GUI: HTML, CSS, JavaScript (no frameworks)
GPU Support: CUDA, NVIDIA Container Toolkit

Testing

The project includes a test suite to verify the core workflow. The tests cover:

A full pipeline test: upload -> classify -> extract -> checklist update.
Duplicate upload detection to ensure the same file cannot be uploaded twice to the same intake.

To run the tests, use the following make command:

make test

What's Next

With more time, here are a few improvements I would consider:

Asynchronous Processing: For classification and extraction, which can be long-running tasks, I would move them to a background worker queue (like Celery with Redis) to prevent blocking the API and provide a better user experience.
Enhanced VLM Extraction: Fine-tune a smaller, more specialized vision model for each document type to improve extraction accuracy and speed.
Better Configuration Management: Externalize settings (like model names, thresholds, and weights) into a configuration file or environment variables instead of hardcoding them.
More Sophisticated GUI: Develop the front-end into a full single-page application (SPA) with a framework like React or Vue.js for a more dynamic and interactive experience.
Expanded Test Coverage: Add more unit tests for individual services and components, in addition to the existing integration tests.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
bucket		bucket
sample_docs		sample_docs
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RPG Document Service

Project Objective

Core Workflow

Features

Key Features (Scalability and Future Deployment)

Architecture

Requirements

1. Docker & Docker Compose

2. NVIDIA GPU & NVIDIA Container Toolkit (Recommended)

3. Make (Optional)

Internal Dependencies (Handled by Docker)

Getting Started

Setup

Usage

Domain Models

API Endpoints

API Usage (with cURL)

1. Create a Client

2. Create an Intake

3. Upload Documents

4. Classify Documents

5. Extract Data

6. Check the Checklist

Technologies Used

Testing

What's Next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RPG Document Service

Project Objective

Core Workflow

Features

Key Features (Scalability and Future Deployment)

Architecture

Requirements

1. Docker & Docker Compose

2. NVIDIA GPU & NVIDIA Container Toolkit (Recommended)

3. Make (Optional)

Internal Dependencies (Handled by Docker)

Getting Started

Setup

Usage

Domain Models

API Endpoints

API Usage (with cURL)

1. Create a Client

2. Create an Intake

3. Upload Documents

4. Classify Documents

5. Extract Data

6. Check the Checklist

Technologies Used

Testing

What's Next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages