Name	Name	Last commit message	Last commit date
parent directory ..
nodes	nodes
.gitignore	.gitignore
README.md	README.md
WORKFLOW.md	WORKFLOW.md
create_graph.py	create_graph.py
logging_config.py	logging_config.py
processing_prompt_template.txt	processing_prompt_template.txt
pyproject.toml	pyproject.toml
register.py	register.py
trigger_graph.py	trigger_graph.py
uv.lock	uv.lock

Sync Document Processing - Getting Started Guide

What is this project?

This is a complete sync document processing system that can extract information from documents one at a time using AI. It's designed to handle documents individually using Gemini's real-time API, providing immediate results for each document processed.

Perfect for:

Processing documents that need immediate feedback
Real-time document analysis workflows
Converting documents to searchable, structured formats with instant results
Applications where you need to process documents as they arrive

Key Difference from Batch Processing

Unlike the batch processing version, this system:

Processes one document at a time using Gemini's real-time API
Provides immediate results for each document
No waiting for batch completion - results are available as soon as each document is processed
Better for real-time applications where you need instant feedback

Quick Start

Prerequisites

Python 3.8 or higher
Gemini API key
PostgreSQL database
Exosphere account and API key

1. Clone and Setup

# Navigate to the project directory
cd sync-process-docs

# Copy the environment template
cp env.example .env

# Edit .env with your credentials
# You'll need: EXOSPHERE_API_KEY, GEMINI_API_KEY, DATABASE_URL

Setup env

uv init

2. Register with Exosphere

# Register your runtime with Exosphere
uv run register.py

3. Create Workflow Template

# Create the processing workflow template
uv run create_graph.py

4. Test with Sample Data

# Process sample documents
uv run trigger_graph.py

How It Works

The system processes documents in 6 main steps:

CSV Input: Reads a list of document file paths from a CSV file
File Distribution: Creates individual processing tasks for each document
Sync Processing: Sends each document to Gemini's real-time API for immediate processing
Validation: Checks that the extracted data is valid JSON
Database Storage: Saves the results to your database immediately
Error Handling: Creates retry files for any failed documents

Your First Document Processing Job

Step 1: Prepare Your Documents

Create a CSV file with your document paths:

file_path
/path/to/document1.pdf
/path/to/document2.docx
/path/to/document3.txt

Step 2: Configure Your Processing

Edit the processing prompt in processing_prompt_template.txt to tell the AI what information to extract from your documents.

Step 3: Run the Processing

# Process your documents
python trigger_graph.py

Step 4: Check Results

View processed data in your database immediately as each document completes
Check logs/app.log for processing details
Look in failures/ folder for any documents that need retry

Common Use Cases

Real-time Document Processing

Process documents as they arrive in your system:

Interactive Document Analysis

Get immediate feedback on document content:

Live Document Monitoring

Monitor and process documents in real-time:

Immediate Data Extraction

Extract structured data from documents instantly:

Workflow Components

The system is built using modular components (called "nodes") that work together:

Input Processing

CSV Input: Reads your list of document file paths
File Distribution: Creates individual processing tasks for each document

AI Processing

Sync Processing: Sends each document to Gemini's real-time API for immediate analysis

Quality Control

Validation: Ensures extracted data is properly formatted
Error Handling: Manages failures and creates retry files

Data Storage

Database Write: Saves all extracted information to your database immediately

More detail: Workflow

Key Features

Real-time Processing

Process documents one at a time for immediate results
No waiting for batch completion
Perfect for real-time applications

AI-Powered Extraction

Uses Gemini's real-time API for instant processing
Customizable prompts for different document types
Handles various file formats (PDF, DOCX, TXT, etc.)

Immediate Data Storage

PostgreSQL database for secure storage
Results saved as soon as each document is processed
Built-in indexing for fast queries

Robust Error Handling

Automatic retry for failed documents
Detailed logging for troubleshooting
Failure reports for manual review

Easy Integration

Web API for external systems
Simple CSV input format
RESTful endpoints for automation

Monitoring and Logging

Log Files

logs/app.log: Main application logs

Health Checks

Database connectivity
Gemini API availability
File system access

Error Handling

Validation Failures

Documents with invalid data are logged
Failure CSV is created for retry
Detailed error messages and context

API Failures

Gemini API failures are handled gracefully
Tasks are requeued with exponential backoff
Timeout handling for long-running tasks

Database Failures

Connection issues are logged
Partial writes are handled appropriately
Transaction rollback on critical failures

Security

API Keys

Stored as secrets in Exosphere
No hardcoded credentials
Environment variable management

Data Privacy

Document paths are logged (ensure no sensitive data)
Extracted data stored securely
Access controls on database

Input Validation

CSV file format validation
File path sanitization
Prompt injection prevention

Troubleshooting Guide

Common Issues and Solutions

"CSV file not found" Error

Problem: The system can't find your document list file
Solution:
- Check the file path in your CSV
- Make sure the file exists and you have read permissions
- Use absolute paths (e.g., /full/path/to/file.pdf) instead of relative paths

"Gemini API Error" Messages

Problem: Gemini API calls are failing
Solutions:
- Verify your Gemini API key is correct
- Check if you have sufficient API credits
- Ensure you're not hitting rate limits
- Check Gemini's service status

"Database Connection Failed"

Problem: Can't connect to your database
Solutions:
- Verify your DATABASE_URL is correct
- Check if your database server is running
- Ensure your database allows connections from your IP
- Test connection with a database client

"Validation Failed" Errors

Problem: AI extracted data doesn't match expected format
Solutions:
- Review and improve your processing prompt
- Check if your documents are readable (not corrupted)
- Try processing a single document first
- Look at the extracted data in logs to see what went wrong

Debugging Steps

Check the Logs

# View the main log file
tail -f logs/app.log

# Look for error messages
grep -i error logs/app.log

Verify Your Setup

# Test your environment variables
python -c "import os; print('API keys loaded:', bool(os.getenv('GEMINI_API_KEY')))"

Test with Sample Data

# Always test with a single document first
# Create a test CSV with just 1 document
# Run the processing and check results

Getting Help

If you're still stuck:

Check the Logs: Look in logs/app.log for detailed error messages
Review Configuration: Double-check your .env file and API keys
Test Components: Try each step individually to isolate the issue
Start Small: Process just 1 document first to verify everything works
Check Service Status: Verify Gemini and Exosphere services are running

Support Resources

Exosphere Documentation: Check the official Exosphere docs
Gemini API Docs: Review Gemini's API documentation
Database Issues: Consult your database provider's documentation
Log Analysis: The log files contain detailed information about what went wrong

Next Steps

Once you have the system working:

Customize for Your Use Case
- Modify the processing prompt in processing_prompt_template.txt
- Set up custom validation rules
- Configure database schema for your data
Monitor and Optimize
- Set up monitoring for your processing jobs
- Optimize database queries for your data patterns
- Track processing times and success rates
Production Considerations
- Implement proper authentication and authorization
- Set up automated backups
- Configure alerting for failures
- Consider scaling options for large document volumes
Testing and Quality
- Create comprehensive test datasets
- Implement automated testing
- Set up quality assurance processes
- Document your specific workflows

You're Ready!

You now have a complete sync document processing system that can handle documents in real-time. Start with a single document, monitor the results, and gradually scale up as you become comfortable with the system.

FilesExpand file tree

sync-process-docs

Directory actions

More options