Document Upload Service

Status

✅ PRODUCTION READY - Core service implementation completed and tested with real legal documents

Implementation Date: January 11, 2025
Test Results: All core components pass with 2000+ real legal PDFs
API Endpoints: 12 endpoints fully implemented
Performance: Content hashing < 100ms, validation < 50ms

Overview

The Document Upload Service is a completely independent microservice designed for uploading, processing, and storing legal documents. It provides comprehensive document processing capabilities including MarkItDown conversion, OCR processing, duplicate detection, and metadata extraction.

Key Features:

Complete independence with NO shared dependencies
MarkItDown document conversion with intelligent fallback to OCR
Advanced OCR processing using Tesseract with parallel page processing
Content-based duplicate detection using SHA256 hashing
Comprehensive file validation and security checks
Independent Supabase integration for data persistence
Integrated centralized logging with fallback capabilities
Health monitoring and status reporting
NEW: Specialized law document upload with entity extraction integration
NEW: Hybrid entity extraction with AI enhancement via Prompt Service
NEW: Automatic document chunking with configurable overlap
NEW: Bluebook citation validation for legal documents

Service Architecture

Independence Design

Port: 8008
NO shared/clients dependencies: Service-specific SupabaseClient implementation
NO shared/config dependencies: Independent configuration management via Pydantic
NO shared/utils dependencies: All utilities implemented within service
NO shared requirements: Complete requirements.txt with all dependencies
Standalone deployment: Runs independently on any server/container

Core Components

1. Document Processing Pipeline

File Upload → Validation → Duplicate Check → Processing → Storage → Response

Processing Methods:

MarkItDown: Primary conversion for Office documents, PDFs, HTML
OCR Fallback: Tesseract OCR for scanned documents and images
Direct Text: For plain text and markdown files

2. Independent Client Architecture

DocumentUploadSupabaseClient: Service-specific database operations
DocumentUploadLogClient: Independent logging with external service integration
Comprehensive error handling: Structured error management with client-safe responses

3. Service-Specific Utilities

FileValidator: Comprehensive file validation (size, type, content, security)
FileMetadataExtractor: Advanced metadata extraction and analysis
ContentHasher: Multi-algorithm content hashing (SHA256, SHA1, MD5)
DuplicateChecker: Content-based duplicate detection
StorageManager: Supabase Storage integration with path management
DocumentUploadErrorHandler: Structured error handling with categorization

API Endpoints

Upload Operations

POST /api/v1/upload - Single document upload with advanced processing
POST /api/v1/upload/batch - Batch document upload
GET /api/v1/upload/status/{document_id} - Get upload status

Law Document Operations (NEW)

POST /api/v1/law/upload - Upload legal document with entity extraction
POST /api/v1/law/upload/batch - Batch upload legal documents
GET /api/v1/law/status/{document_id} - Get law document processing status
GET /api/v1/law/search - Search law documents with filters
DELETE /api/v1/law/{document_id} - Delete law document and associated data
GET /api/v1/law/statistics - Get law document statistics

Status and Monitoring

GET /api/v1/status/service - Service health and statistics
GET /api/v1/status/config - Service configuration and capabilities
GET /api/v1/status/uploads - List uploads with filtering

Health Monitoring

The service provides standardized health check endpoints:

GET /api/v1/health - Basic health status
GET /api/v1/health/ping - Simple ping check for load balancers
GET /api/v1/health/ready - Readiness check with dependency verification
GET /api/v1/health/detailed - Comprehensive health information including metrics

Example:

# Check basic health
curl http://localhost:8008/api/v1/health

# Check readiness
curl http://localhost:8008/api/v1/health/ready

# Get detailed health info
curl http://localhost:8008/api/v1/health/detailed

Configuration

Environment Variables

Service Identity

SERVICE_NAME=document-upload-service
SERVICE_VERSION=1.0.0
SERVICE_ENVIRONMENT=development
SERVICE_PORT=8008
DEBUG_MODE=true

Supabase Configuration

SUPABASE_URL=your_supabase_url
SUPABASE_ANON_KEY=your_anon_key
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
SUPABASE_DB_CONNECTION_TIMEOUT=30
SUPABASE_DB_MAX_CONNECTIONS=10

External Services (NEW)

# Entity Extraction Service
ENTITY_EXTRACTION_URL=http://localhost:8007
ENTITY_EXTRACTION_TIMEOUT=30
ENTITY_EXTRACTION_MODE=hybrid
ENABLE_ENTITY_EXTRACTION=true

# Prompt Service for AI Enhancement
PROMPT_SERVICE_URL=http://localhost:8003
PROMPT_SERVICE_TIMEOUT=30
ENABLE_PROMPT_SERVICE=true

# Chunking Service
CHUNKING_SERVICE_URL=http://localhost:8009
CHUNKING_SERVICE_TIMEOUT=30
ENABLE_CHUNKING_SERVICE=true

Law Document Processing (NEW)

# Chunking Configuration
LAW_DOCUMENT_CHUNK_SIZE=1000
LAW_DOCUMENT_CHUNK_OVERLAP=200

# Entity Extraction Settings
HYBRID_EXTRACTION_CONFIDENCE_THRESHOLD=0.7
ENABLE_BLUEBOOK_VALIDATION=true
ENTITY_EXTRACTION_MAX_RETRIES=3
ENTITY_EXTRACTION_RETRY_BACKOFF=2.0

# Document Cache
ENABLE_DOCUMENT_CACHE=true
DOCUMENT_CACHE_TTL=3600
CACHE_PATH=/tmp/document-cache

File Processing

MAX_FILE_SIZE_MB=100
MIN_FILE_SIZE_BYTES=1
SUPPORTED_FORMATS=pdf,docx,doc,pptx,ppt,xlsx,xls,txt,md,html,htm,jpg,jpeg,png,tiff,tif,bmp,gif,webp,svg
ENABLE_FILE_VALIDATION=true
ENABLE_CONTENT_TYPE_VALIDATION=true
DISABLE_EXECUTABLE_UPLOADS=true

MarkItDown Configuration

MARKITDOWN_TIMEOUT=120
MARKITDOWN_MAX_RETRIES=2

OCR Configuration

OCR_TIMEOUT=300
OCR_DPI=300
OCR_LANGUAGE=eng
OCR_ADDITIONAL_LANGUAGES=spa,fra
OCR_CONFIG_FLAGS=--oem 3 --psm 6
OCR_CONFIDENCE_THRESHOLD=30
OCR_MAX_PARALLEL_PAGES=3
OCR_IMAGE_PREPROCESSING=true

Storage Configuration

STORAGE_BUCKET=document-uploads
STORAGE_PATH_PATTERN=clients/{client_id}/cases/{case_id}/documents/{document_id}/{filename}
STORAGE_PUBLIC_BUCKET=false
ENABLE_FILE_VERSIONING=false
ENABLE_STORAGE_ENCRYPTION=true
FILE_RETENTION_DAYS=365

External Services

LOG_SERVICE_URL=http://log-service:8001
LOG_SERVICE_TIMEOUT=10
LOG_SERVICE_ENABLE_FALLBACK=true

Installation & Deployment

Using Docker (Recommended)

Build the service:

docker build -t document-upload-service .

Run with Docker Compose:

docker-compose up -d

Check service health:

curl http://localhost:8008/api/v1/health/ping

Manual Installation

Install system dependencies:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

Install Python dependencies:

pip install -r requirements.txt

Run the service:

python run.py

System Dependencies

Required for OCR Processing

Tesseract OCR: Text recognition engine
poppler-utils: PDF processing utilities (pdftoppm, pdfinfo)
Python packages: pytesseract, pdf2image, Pillow

Required for Document Processing

MarkItDown: Microsoft's document conversion library
python-magic: File type detection
httpx: Async HTTP client

Usage Examples

Single Document Upload

import httpx

files = {"file": ("document.pdf", open("document.pdf", "rb"), "application/pdf")}
data = {
    "client_id": "client123",
    "case_id": "case456", 
    "document_type": "contract"
}

async with httpx.AsyncClient() as client:
    response = await client.post(
        "http://localhost:8008/api/v1/upload",
        files=files,
        data=data
    )
    result = response.json()

Check Upload Status

curl -X GET "http://localhost:8008/api/v1/status/upload/upload_id_here?include_details=true"

Get Service Configuration

curl -X GET "http://localhost:8008/api/v1/status/config?include_capabilities=true"

Processing Capabilities

Supported File Formats

Documents: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS
Text: TXT, MD, HTML, HTM
Images: JPG, JPEG, PNG, TIFF, TIF, BMP, GIF, WEBP, SVG

Processing Methods

MarkItDown Conversion: High-quality conversion for Office documents and PDFs
OCR Processing: Tesseract-based text extraction for scanned documents
Intelligent Fallback: Automatic fallback from MarkItDown to OCR when needed
Quality Assessment: Processing quality scoring and validation

Advanced Features

Parallel OCR: Multi-page PDF processing with configurable parallelism
Image Preprocessing: Automatic image enhancement for better OCR results
Content Validation: Multi-layer file validation and security checks
Duplicate Detection: SHA256-based content hashing for duplicate prevention

Monitoring & Health Checks

Health Endpoints

Ping: /api/v1/health/ping - Basic availability check
Detailed: /api/v1/health/detailed - Component status and metrics

Service Statistics

Total uploads processed
Success/failure rates
Processing time metrics
Component health status
Configuration validation

Component Monitoring

Supabase connectivity
MarkItDown availability
OCR system status (Tesseract, poppler-utils)
Storage system health
External service connectivity

Error Handling

Structured Error Responses

{
  "error": true,
  "error_id": "document-upload-service-1673123456-0001",
  "error_type": "validation",
  "message": "File size exceeds maximum limit",
  "timestamp": "2023-01-08T10:30:56Z",
  "resolution_hints": [
    "Check file size requirements",
    "Compress the file before uploading"
  ]
}

Error Categories

Validation: File format, size, content validation errors
Processing: Document processing and conversion errors
Storage: File storage and retrieval errors
External: External service connectivity errors
System: Internal system and configuration errors

Performance Characteristics

Throughput

Small files (< 1MB): ~50 files/minute
Medium files (1-10MB): ~20 files/minute
Large files (10-100MB): ~5 files/minute
OCR processing: ~1-2 pages/second

Resource Usage

Memory: 512MB - 2GB (depending on file sizes and OCR usage)
CPU: 0.5 - 1.0 cores (with OCR parallelism)
Storage: Temporary files cleaned up automatically
Network: Minimal external dependencies

Security Features

File Security

Executable file blocking (configurable)
Content-based validation
Magic number verification
Suspicious pattern detection

Data Protection

Storage encryption (optional)
Secure file path generation
Content hash verification
Input sanitization

Access Control

Request ID tracking
Client isolation
Rate limiting support
CORS configuration

Development

Project Structure

services/document-upload-service/
├── src/
│   ├── api/routes/          # FastAPI route handlers
│   ├── clients/             # Database and external service clients
│   ├── config/              # Configuration management
│   ├── models/              # Pydantic request/response models
│   ├── services/            # Core business logic
│   └── utils/               # Utility functions
├── tests/                   # Test suite
├── deployment/              # Deployment configurations
├── docs/                    # Documentation
├── Dockerfile               # Container configuration
├── docker-compose.yml       # Multi-service deployment
├── requirements.txt         # Python dependencies
└── run.py                   # Service entry point

Testing

# Run all tests
python -m pytest tests/

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html

Development Mode

# Run with auto-reload
uvicorn src.main:app --host 0.0.0.0 --port 8008 --reload

Troubleshooting

Common Issues

OCR Not Working
- Verify Tesseract installation: tesseract --version
- Check poppler-utils: pdftoppm -h
- Verify Python packages: pip list | grep tesseract
MarkItDown Conversion Failures
- Check supported file formats
- Verify file integrity
- Review processing timeout settings
Storage Issues
- Verify Supabase configuration
- Check storage bucket permissions
- Validate connection settings
Performance Issues
- Monitor memory usage during OCR
- Adjust parallel processing limits
- Check file size configurations

Logs and Debugging

Service logs: /app/logs/ (in container)
Debug mode: Set DEBUG_MODE=true
Log levels: ERROR, WARNING, INFO, DEBUG
Structured logging with request ID tracking

License

Support

For technical support, configuration assistance, or bug reports, please contact the development team or refer to the comprehensive API documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
Luris Document Upload Service.app/Contents		Luris Document Upload Service.app/Contents
deployment		deployment
docs		docs
plans/active/document-upload-comprehensive-testing		plans/active/document-upload-comprehensive-testing
src		src
tests		tests
.env.example		.env.example
.env.marker		.env.marker
.env.test.backup		.env.test.backup
.gitignore		.gitignore
Dockerfile		Dockerfile
HYBRID_PDF_INTEGRATION.md		HYBRID_PDF_INTEGRATION.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
MARKER_INTEGRATION.md		MARKER_INTEGRATION.md
README.md		README.md
TIMEOUT_OPTIMIZATION_REPORT.md		TIMEOUT_OPTIMIZATION_REPORT.md
api.md		api.md
comprehensive_hybrid_pdf_test.py		comprehensive_hybrid_pdf_test.py
comprehensive_test_summary.md		comprehensive_test_summary.md
comprehensive_upload_extraction_test.py		comprehensive_upload_extraction_test.py
coverage.xml		coverage.xml
diagnostic_logging_summary.md		diagnostic_logging_summary.md
docker-compose.yml		docker-compose.yml
final_test_response.json		final_test_response.json
focused_upload_extraction_test.py		focused_upload_extraction_test.py
hybrid_pdf_api_test.py		hybrid_pdf_api_test.py
hybrid_pdf_comprehensive_analysis.py		hybrid_pdf_comprehensive_analysis.py
hybrid_pdf_focused_test.py		hybrid_pdf_focused_test.py
marker_final_gpu_test_result.json		marker_final_gpu_test_result.json
marker_final_test_result.json		marker_final_test_result.json
marker_gpu_final_test_result.json		marker_gpu_final_test_result.json
marker_new_test_result.json		marker_new_test_result.json
marker_test_final.json		marker_test_final.json
marker_test_result.json		marker_test_result.json
marker_test_unique.txt		marker_test_unique.txt
marker_test_v2_result.json		marker_test_v2_result.json
quick_entity_test.py		quick_entity_test.py
regex_entity_test.py		regex_entity_test.py
requirements.txt		requirements.txt
requirements_core.txt		requirements_core.txt
response.json		response.json
response_chambers.json		response_chambers.json
response_final.json		response_final.json
response_kansas.json		response_kansas.json
response_marker_test.json		response_marker_test.json
response_new.json		response_new.json
response_range.json		response_range.json
response_test.json		response_test.json
run.py		run.py
run_clean.py		run_clean.py
run_service.py		run_service.py
test_document.txt		test_document.txt
test_graph_registry.py		test_graph_registry.py
test_hybrid_integration.py		test_hybrid_integration.py
test_law_upload.py		test_law_upload.py
test_legal_doc.txt		test_legal_doc.txt
test_marker_api.py		test_marker_api.py
test_marker_integration.py		test_marker_integration.py
test_marker_response.json		test_marker_response.json
test_report_law_upload.md		test_report_law_upload.md
test_response.json		test_response.json
test_upload.txt		test_upload.txt
test_upload_api.py		test_upload_api.py
test_upload_rahimi.py		test_upload_rahimi.py
validate_deployment.py		validate_deployment.py
verify_integration.py		verify_integration.py

brianjwalters/document-upload-service

Folders and files

Latest commit

History

Repository files navigation