This is a complete batch document processing system that can extract information from large numbers of documents using AI. It's designed to handle hundreds or thousands of documents efficiently by processing them in batches, using Gemini's API to extract structured data, and storing the results in a database.
Perfect for:
- Processing large document collections
- Extracting structured data from unstructured documents
- Automating document analysis workflows
- Converting documents to searchable, structured formats
- Python 3.8 or higher
- Gemini API key
- MongoDB URI
- Exosphere account and API key
# Navigate to the project directory
cd batch-process-docs
# Copy the environment template
cp env.example .env
# Edit .env with your credentials
# You'll need: EXOSPHERE_API_KEY, GEMINI_API_KEY, DATABASE_URLSetup env
uv init# Register your runtime with Exosphere
uv run register.py# Create the processing workflow template
uv run create_graph.py# Process sample documents
uv run trigger_graph.pyThe system processes documents in 7 main steps:
- CSV Input: Reads a list of document file paths from a CSV file
- Chunking: Groups documents into batches (e.g., 10 documents per batch)
- AI Processing: Sends each batch to Gemini for information extraction
- Polling: Waits for Gemini to complete processing
- Validation: Checks that the extracted data is valid JSON
- Database Storage: Saves the results to your database
- Error Handling: Creates retry files for any failed documents
Create a CSV file with your document paths:
file_path
/path/to/document1.pdf
/path/to/document2.docx
/path/to/document3.txtEdit the processing prompt in processing_prompt_template.txt to tell the AI what information to extract from your documents.
# Process your documents
python trigger_graph.py- View processed data in your database
- Check
logs/app.logfor processing details - Look in
failures/folder for any documents that need retry
Extract key information from contracts, agreements, and legal documents:
Process financial documents and reports:
Extract structured data from academic papers:
Automate invoice and receipt data extraction:
Extract patient information from medical documents:
The system is built using modular components (called "nodes") that work together:
- CSV Input: Reads your list of document file paths
- Chunking: Groups documents into manageable batches
- Batch Processing: Sends document batches to Gemini for analysis
- Polling: Waits for AI processing to complete
- Validation: Ensures extracted data is properly formatted
- Error Handling: Manages failures and creates retry files
- Database Write: Saves all extracted information to your database
More detail: Workflow
- Process hundreds of documents at once
- Configurable batch sizes (default: 10 documents per batch)
- Parallel processing for maximum speed
- Uses Gemini's advanced language models
- Customizable prompts for different document types
- Handles various file formats (PDF, DOCX, TXT, etc.)
- MongoDB database for secure storage
- Flexible JSON format for any data structure
- Built-in indexing for fast queries
- Automatic retry for failed documents
- Detailed logging for troubleshooting
- Failure reports for manual review
- Web API for external systems
- Simple CSV input format
- RESTful endpoints for automation
logs/app.log: Main application logs
- Database connectivity
- Gemini API availability
- File system access
- Documents with invalid data are logged
- Failure CSV is created for retry
- Detailed error messages and context
- Gemini API failures are handled gracefully
- Tasks are requeued with exponential backoff
- Timeout handling for long-running tasks
- Connection issues are logged
- Partial writes are handled appropriately
- Transaction rollback on critical failures
- Stored as secrets in Exosphere
- No hardcoded credentials
- Environment variable management
- Document paths are logged (ensure no sensitive data)
- Extracted data stored securely
- Access controls on database
- CSV file format validation
- File path sanitization
- Prompt injection prevention
- Problem: The system can't find your document list file
- Solution:
- Check the file path in your CSV
- Make sure the file exists and you have read permissions
- Use absolute paths (e.g.,
/full/path/to/file.pdf) instead of relative paths
- Problem: Gemini API calls are failing
- Solutions:
- Verify your Gemini API key is correct
- Check if you have sufficient API credits
- Ensure you're not hitting rate limits (try smaller batch sizes)
- Check Gemini's service status
- Problem: Can't connect to your database
- Solutions:
- Verify your DATABASE_URL is correct
- Check if your database server is running
- Ensure your database allows connections from your IP
- Test connection with a database client
- Problem: AI extracted data doesn't match expected format
- Solutions:
- Review and improve your processing prompt
- Check if your documents are readable (not corrupted)
- Try processing a smaller batch first
- Look at the extracted data in logs to see what went wrong
-
Check the Logs
# View the main log file tail -f logs/app.log # Look for error messages grep -i error logs/app.log
-
Verify Your Setup
# Test your environment variables python -c "import os; print('API keys loaded:', bool(os.getenv('GEMINI_API_KEY')))"
-
Test with Sample Data
# Always test with a small batch first # Create a test CSV with just 2-3 documents # Run the processing and check results
If you're still stuck:
- Check the Logs: Look in
logs/app.logfor detailed error messages - Review Configuration: Double-check your
.envfile and API keys - Test Components: Try each step individually to isolate the issue
- Start Small: Process just 1-2 documents first to verify everything works
- Check Service Status: Verify Gemini and Exosphere services are running
- Exosphere Documentation: Check the official Exosphere docs
- Gemini API Docs: Review Gemini's API documentation
- Database Issues: Consult your database provider's documentation
- Log Analysis: The log files contain detailed information about what went wrong
Once you have the system working:
-
Customize for Your Use Case
- Modify the processing prompt in
processing_prompt_template.txt - Adjust batch sizes based on your document types
- Set up custom validation rules
- Modify the processing prompt in
-
Monitor and Optimize
- Set up monitoring for your processing jobs
- Optimize database queries for your data patterns
- Track processing times and success rates
-
Production Considerations
- Implement proper authentication and authorization
- Set up automated backups
- Configure alerting for failures
- Consider scaling options for large document volumes
-
Testing and Quality
- Create comprehensive test datasets
- Implement automated testing
- Set up quality assurance processes
- Document your specific workflows
You now have a complete document processing system that can handle large volumes of documents efficiently. Start with small batches, monitor the results, and gradually scale up as you become comfortable with the system.