A FastAPI service for document processing and workflow automation in accounting.
This service provides a FastAPI backend to:
- Upload documents (PDF, JPG, PNG)
- Classify documents using a hybrid AI approach
- Extract structured data using a vision-language model (VLM)
- Manage document intake workflows with checklists
Built with FastAPI, Docker, and Hugging Face Transformers.
The goal of this project is to build a core component of accounting workflows: document understanding. The service implements a complete pipeline that ingests, classifies, and extracts data from common accounting-related documents, automating a traditionally manual process.
The application follows a structured workflow designed to mirror a real-world accounting process:
- Create Client: A client is registered with the system. Each client has a
complexitylevel (simple,average,complex) which determines the number of documents expected for a given tax year. - Create Intake: An
Intakeis created for a client for a specificfiscal_year. This acts as a container for all documents related to that work session. Upon creation, a checklist is automatically generated based on the client's complexity. - Upload Documents: Documents (PDF, JPG, PNG) are uploaded to a specific intake. The system calculates a SHA256 hash to prevent duplicate files within the same intake.
- Classify Documents: All unclassified documents in an intake are processed by the hybrid classification engine, which determines the
doc_kind(e.g.,T4,id,receipt). - Extract Data: Once classified, the vision-language model extracts key fields from the documents (e.g.,
employer_namefrom a T4,total_amountfrom a receipt). - Update Checklist: As data is successfully extracted from documents, the corresponding items on the intake's checklist are marked as
received. Once all required documents are received, the intake's status is automatically set todone.
- Hybrid Document Classification: Combines filename analysis, regex pattern matching, and a BERT-based zero-shot classification model for robust and explainable document identification.
- VLM Data Extraction: Utilizes the Qwen3-VL-4B vision-language model to extract structured data directly from document images.
- GPU Acceleration: Supports NVIDIA GPUs for both classification and extraction, significantly speeding up processing times.
- Efficient Memory Management: Dynamically loads and unloads AI models to prevent GPU out-of-memory errors when processing multiple large documents.
- PDF & Image Support: Automatically converts PDFs to images for VLM processing.
- RESTful API: A clean, modern API built with FastAPI, including automatically generated documentation.
- Modern Web GUI: A simple, intuitive web interface for interacting with the entire workflow, from client creation to data extraction.
The service includes several production-ready optimizations and can be easily deployed on cloud infrastructure:
- Performance Optimizations: Intelligent model caching reduces processing time by 12x, automatic image resizing handles large files efficiently, and optimized token limits ensure complete data extraction
- Production Architecture: Docker containerization with persistent volumes, built-in health checks, and PostgreSQL database with automatic connection management
- Cloud Deployment Ready: Stateless design enables horizontal scaling, native GPU support for accelerated processing, and detailed logging for monitoring
- Developer Experience: Hot reload development, comprehensive error handling, and clean modular architecture with full type safety
- Scalable Design: Microservice architecture supports easy service extraction, API-first design enables external integrations, and plugin architecture allows adding new document types
┌─────────────────────────────────────────────────────────┐
│ WORKFLOW STAGES │
├─────────────────────────────────────────────────────────┤
│ │
│ 1. CLASSIFICATION (Hybrid: Regex + BERT on GPU) │
│ ├─ Load BERT model (~1.5 GB RAM) │
│ ├─ Classify all documents │
│ └─ **UNLOAD MODEL** │
│ │
│ 2. GPU MEMORY CLEARED │
│ └─ torch.cuda.empty_cache() │
│ │
│ 3. EXTRACTION (Qwen3-VL on GPU) │
│ ├─ Load Qwen3-VL-4B (~10 GB VRAM) │
│ ├─ Convert PDF→Image if needed │
│ ├─ Extract data from each document │
│ └─ Unload model between documents if needed │
│ │
└─────────────────────────────────────────────────────────┘
This project is designed to be run with Docker, which simplifies dependency management. The following are required on your host machine:
- Purpose: To build and run the application services (FastAPI app, PostgreSQL database) in isolated containers.
- Setup: Follow the official installation instructions for your operating system.
- Purpose: To provide GPU acceleration for the AI models (both classification and extraction). While the application can run on CPU, performance will be significantly slower (e.g., extractions may take minutes instead of seconds).
- GPU: An NVIDIA GPU with at least 12GB of VRAM is recommended to run the 4B vision model.
- Setup: You must install the NVIDIA Container Toolkit to allow Docker containers to access the GPU.
- Purpose: The
Makefileprovides convenient shortcuts for common commands likemake upandmake down. - Setup: This is typically pre-installed on Linux and macOS. For Windows, you can use
docker-composecommands directly.
You do not need to install these on your local machine if you are using Docker, as they are provisioned automatically within the container:
- Tesseract OCR: An OCR engine used to extract plain text from documents. This text is then used by the classification models (Regex and BERT).
- Poppler Utilities: Provides PDF rendering capabilities, allowing the application to convert PDF files into images for processing by the vision models.
- Python 3.11: The core programming language for the application.
- PyTorch & Transformers: The primary libraries for running the AI models.
-
Clone the repository.
-
Start the services using Docker Compose:
make up
This command will build the Docker image (which may take a few minutes on the first run as it downloads dependencies and AI models) and start the application.
-
Access the services:
- Web GUI:
http://localhost:8000/ - API Docs (Swagger):
http://localhost:8000/docs
- Web GUI:
Use the web GUI at http://localhost:8000/ to follow the step-by-step workflow:
- Create a Client: Define a new client and their complexity level.
- Create an Intake: Start a new document intake process for a client.
- Upload Documents: Upload PDFs, JPGs, or PNGs.
- Process Documents:
- Classify: Run the hybrid classification to identify document types.
- Extract: Run VLM data extraction on the classified documents.
- View Checklist: Check the status of the intake process.
The application uses a set of Pydantic and SQLAlchemy models to represent the core business logic:
-
Client: Represents an individual whose documents are being processed.
name:stringemail:stringcomplexity:enum(simple|average|complex)
-
Intake: Represents a work session for a client for a specific fiscal year.
client_id:foreign keyto Clientfiscal_year:intstatus:enum(open|done)
-
Document: Represents a file that has been uploaded.
intake_id:foreign keyto Intakefilename:stringsha256:string(for duplicate detection)doc_kind:enum(T4|receipt|id|unknown)
-
ChecklistItem: Represents a required document for an intake.
intake_id:foreign keyto Intakedoc_kind:enum(the required document type)status:enum(missing|received)
The service exposes a RESTful API for interacting with the workflow.
POST /clients: Creates a new client.POST /intakes: Creates a new intake for a client and initializes its checklist.POST /intakes/{id}/documents: Uploads a document to a specific intake.POST /documents/{document_id}/classify: Classifies a single document.POST /intakes/{id}/classify: Classifies all unclassified documents in an intake.POST /documents/{document_id}/extract: Extracts data from a single, classified document.POST /intakes/{id}/extract: Extracts data from all classified documents in an intake.GET /intakes/{id}/checklist: Retrieves the current status of an intake's checklist.GET /documents/{document_id}/raw_text: Retrieves the raw OCR text from a document.
Here is a complete workflow example using curl commands.
First, create a client. The API will return the client's details, including the id you'll need for the next step.
curl -X POST "http://localhost:8000/clients" \
-H "Content-Type: application/json" \
-d '{
"name": "John Doe",
"email": "john.doe@example.com",
"complexity": "simple"
}'Using the client_id from the previous step, create an intake for a specific fiscal year.
# Replace {client_id} with the actual ID from step 1
curl -X POST "http://localhost:8000/intakes" \
-H "Content-Type: application/json" \
-d '{
"client_id": 1,
"fiscal_year": 2024
}'Upload the necessary documents to the intake. You can upload multiple files.
# Replace {intake_id} with the actual ID from step 2
curl -X POST "http://localhost:8000/intakes/1/documents" \
-F "file=@sample_docs/T4_sample.pdf"
curl -X POST "http://localhost:8000/intakes/1/documents" \
-F "file=@sample_docs/drivers_license.jpg"Run the hybrid classification process on all unclassified documents in the intake. This will use the BERT model on the CPU.
# Replace {intake_id} with the actual ID from step 2
curl -X POST "http://localhost:8000/intakes/1/classify"Run the data extraction process. This will load the Qwen3-VL model on the GPU and extract structured data from the classified documents.
# Replace {intake_id} with the actual ID from step 2
curl -X POST "http://localhost:8000/intakes/1/extract"Finally, check the intake's checklist to see the final status.
# Replace {intake_id} with the actual ID from step 2
curl -X GET "http://localhost:8000/intakes/1/checklist"- Backend: FastAPI, Python 3.11
- AI Models:
facebook/bart-large-mnlifor classificationQwen/Qwen3-VL-4B-Instruct-FP8for extraction
- AI Frameworks: Hugging Face Transformers, PyTorch
- OCR: Tesseract
- Database: PostgreSQL
- Containerization: Docker, Docker Compose
- GUI: HTML, CSS, JavaScript (no frameworks)
- GPU Support: CUDA, NVIDIA Container Toolkit
The project includes a test suite to verify the core workflow. The tests cover:
- A full pipeline test:
upload -> classify -> extract -> checklist update. - Duplicate upload detection to ensure the same file cannot be uploaded twice to the same intake.
To run the tests, use the following make command:
make testWith more time, here are a few improvements I would consider:
- Asynchronous Processing: For classification and extraction, which can be long-running tasks, I would move them to a background worker queue (like Celery with Redis) to prevent blocking the API and provide a better user experience.
- Enhanced VLM Extraction: Fine-tune a smaller, more specialized vision model for each document type to improve extraction accuracy and speed.
- Better Configuration Management: Externalize settings (like model names, thresholds, and weights) into a configuration file or environment variables instead of hardcoding them.
- More Sophisticated GUI: Develop the front-end into a full single-page application (SPA) with a framework like React or Vue.js for a more dynamic and interactive experience.
- Expanded Test Coverage: Add more unit tests for individual services and components, in addition to the existing integration tests.