Production-ready extraction pipeline using LangExtract, LangGraph, and MinIO for structured German business data extraction from Impressum/About pages.
- π€ LangExtract Integration - Google's structured extraction with German business prompts
- π LangGraph Workflow - State-based orchestration
- π¦ MinIO Storage - S3-compatible object storage
- β‘ Parallel Processing - 5 workers with thread pool
- π Retry Logic - Exponential backoff (3 attempts)
- π¦ Rate Limiting - API quota protection (20 req/min)
- π Statistics - Real-time metrics & JSON reports
- π³ Docker Ready - Full containerization
- π Model Flexibility - Gemini API or Ollama (local)
# Clone repository
git clone https://github.com/MrBozkay/langraph_extract_agent.git
cd langraph_extract_agent
# Configure environment
cp .env.example .env
# Edit .env with your GOOGLE_API_KEY and MinIO credentials
# Run with Docker
docker-compose up --build# Setup virtual environment
./setup_venv.sh
source venv/bin/activate
# Configure
cp .env.example .env
nano .env # Add your credentials
# Run production batch
python src/agents/run_batch_production.py| Agent | Use Case | Speed | Features |
|---|---|---|---|
| Production Batch | Production deployment | β‘β‘β‘ | Parallel, Retry, Rate limit, Limited objects |
| LangGraph | State tracking | β‘ | Workflow visualization, Limited objects |
| Simple Batch | Testing/Debug | β‘ | Easy to understand |
π‘ Recommendation: Use run_batch_production.py for production.
π§ New Features:
list_objects()withlimitparameter (default: 50)recursive=Falsefor non-recursive listing (default:True)- Command-line parameters for
test_minio.py
{
"owner_name": "Hans MΓΌller",
"position": "GeschΓ€ftsfΓΌhrer",
"company_name": "Mustermann GmbH",
"email": "h.mueller@mustermann.de",
"phone": "+49 123 456789",
"website": "www.mustermann.de",
"sector": "Consulting"
}- Quick Start Guide - 5-minute setup
- Agent Selection Guide - Choose the right agent
- Production Deployment - Production best practices
- Contributing - How to contribute
# MinIO (Remote or Local)
MINIO_ENDPOINT=your-minio-server:9000
MINIO_ACCESS_KEY=your-access-key
MINIO_SECRET_KEY=your-secret-key
MINIO_BUCKET_NAME=your-bucket
# LLM Model
GOOGLE_API_KEY=your-gemini-api-key
LANGEXTRACT_MODEL=gemini-2.0-flash-exp
# Production Settings
EXTRACTION_MAX_WORKERS=5
EXTRACTION_RETRY_COUNT=3
RATE_LIMIT_REQUESTS_PER_MINUTE=20# Build image
docker build -t langraph-extract-agent .
# Run with docker-compose
docker-compose up -d
# View logs
docker-compose logs -f extraction-app
# Stop
docker-compose down# Test MinIO connection (with parameters)
python test_minio.py --help # Show help
python test_minio.py # Default: 5 files, non-recursive
python test_minio.py --recursive --limit 20 # Recursive with 20 files
python test_minio.py --prefix "folder/" --limit 10 # Custom prefix
# Test extraction
python test_extraction.py
# Test production features
python test_production_features.py- Processing Speed: ~2-3 files/second (5 workers)
- Success Rate: >95% (with retry logic)
- Extraction Time: ~2-5 seconds per file
- Memory Usage: Optimized with object limiting
- Network Efficiency: Non-recursive listing by default
Switch to local inference:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gpt-oss:20b
# Update .env
LANGEXTRACT_MODEL=ollama/gpt-oss:20bContributions are welcome! See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE file for details.
- LangExtract - Google's extraction library
- LangGraph - Workflow orchestration
- MinIO - S3-compatible object storage
Built with β€οΈ for German business data extraction