Skip to content

MrBozkay/langraph_extract_agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ LangGraph Extract Agent

License: MIT Python 3.11+ Docker

Production-ready extraction pipeline using LangExtract, LangGraph, and MinIO for structured German business data extraction from Impressum/About pages.

✨ Features

  • πŸ€– LangExtract Integration - Google's structured extraction with German business prompts
  • πŸ”„ LangGraph Workflow - State-based orchestration
  • πŸ“¦ MinIO Storage - S3-compatible object storage
  • ⚑ Parallel Processing - 5 workers with thread pool
  • πŸ” Retry Logic - Exponential backoff (3 attempts)
  • 🚦 Rate Limiting - API quota protection (20 req/min)
  • πŸ“Š Statistics - Real-time metrics & JSON reports
  • 🐳 Docker Ready - Full containerization
  • πŸ”Œ Model Flexibility - Gemini API or Ollama (local)

πŸš€ Quick Start

Docker (Recommended)

# Clone repository
git clone https://github.com/MrBozkay/langraph_extract_agent.git
cd langraph_extract_agent

# Configure environment
cp .env.example .env
# Edit .env with your GOOGLE_API_KEY and MinIO credentials

# Run with Docker
docker-compose up --build

Manual Setup

# Setup virtual environment
./setup_venv.sh
source venv/bin/activate

# Configure
cp .env.example .env
nano .env  # Add your credentials

# Run production batch
python src/agents/run_batch_production.py

πŸ“Š Agent Selection

Agent Use Case Speed Features
Production Batch Production deployment ⚑⚑⚑ Parallel, Retry, Rate limit, Limited objects
LangGraph State tracking ⚑ Workflow visualization, Limited objects
Simple Batch Testing/Debug ⚑ Easy to understand

πŸ’‘ Recommendation: Use run_batch_production.py for production.

πŸ”§ New Features:

  • list_objects() with limit parameter (default: 50)
  • recursive=False for non-recursive listing (default: True)
  • Command-line parameters for test_minio.py

🎯 Example Output

{
  "owner_name": "Hans MΓΌller",
  "position": "GeschΓ€ftsfΓΌhrer",
  "company_name": "Mustermann GmbH",
  "email": "h.mueller@mustermann.de",
  "phone": "+49 123 456789",
  "website": "www.mustermann.de",
  "sector": "Consulting"
}

πŸ“š Documentation

πŸ”§ Configuration

# MinIO (Remote or Local)
MINIO_ENDPOINT=your-minio-server:9000
MINIO_ACCESS_KEY=your-access-key
MINIO_SECRET_KEY=your-secret-key
MINIO_BUCKET_NAME=your-bucket

# LLM Model
GOOGLE_API_KEY=your-gemini-api-key
LANGEXTRACT_MODEL=gemini-2.0-flash-exp

# Production Settings
EXTRACTION_MAX_WORKERS=5
EXTRACTION_RETRY_COUNT=3
RATE_LIMIT_REQUESTS_PER_MINUTE=20

🐳 Docker Commands

# Build image
docker build -t langraph-extract-agent .

# Run with docker-compose
docker-compose up -d

# View logs
docker-compose logs -f extraction-app

# Stop
docker-compose down

πŸ§ͺ Testing

# Test MinIO connection (with parameters)
python test_minio.py --help                    # Show help
python test_minio.py                           # Default: 5 files, non-recursive
python test_minio.py --recursive --limit 20     # Recursive with 20 files
python test_minio.py --prefix "folder/" --limit 10  # Custom prefix

# Test extraction
python test_extraction.py

# Test production features
python test_production_features.py

πŸ“ˆ Performance

  • Processing Speed: ~2-3 files/second (5 workers)
  • Success Rate: >95% (with retry logic)
  • Extraction Time: ~2-5 seconds per file
  • Memory Usage: Optimized with object limiting
  • Network Efficiency: Non-recursive listing by default

πŸ”„ Ollama Support

Switch to local inference:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gpt-oss:20b

# Update .env
LANGEXTRACT_MODEL=ollama/gpt-oss:20b

🀝 Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments


Built with ❀️ for German business data extraction

About

πŸš€ Production-ready extraction pipeline using LangExtract, LangGraph, and MinIO for structured German business data extraction

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors