Skip to content

udit-asopa/vision-text-extractor

 
 

Repository files navigation

Vision Text Extractor

Extract text from images and documents using multiple AI providers. Choose from local models (SmolVLM, LLaVA) or cloud-based (OpenAI GPT-4o) for maximum flexibility.

✨ Features

  • 🤖 3 AI Providers: Local SmolVLM/LLaVA or cloud OpenAI GPT-4o
  • 🔒 Privacy-First: Local processing keeps your data private
  • 🌐 Flexible Input: Local files or web URLs
  • 💬 Custom Prompts: Extract specific information
  • Easy Setup: One-command installation with Pixi

🚀 Quick Start

# Clone and install
git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install

# Quick demo
pixi run demo-ocr-huggingface

# Use with your images  
python main.py path/to/your/image.jpg
python main.py "https://example.com/image.png"

📖 Documentation

For detailed guides and tutorials, visit our Wiki:

🛠️ Installation

Prerequisites

  • Pixi package manager
  • Python 3.10+ (managed by Pixi)

Setup

git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install
pixi run setup

Choose Your AI Provider

🟢 Local & Free (Recommended)

pixi run setup-smolvlm    # Hugging Face SmolVLM (~2GB)
pixi run setup-ollama     # Ollama LLaVA (~4GB)  

🟡 Cloud & Paid (Highest Accuracy)

# Add your OpenAI key to .env file
echo "OPENAI_API_KEY=your_key_here" >> .env

💡 Basic Usage

# Extract text from any image
python main.py path/to/your/image.jpg

# Process web images
python main.py "https://example.com/document.png"

# Custom extraction prompt
python main.py receipt.jpg --prompt "Extract total amount and date"

# Try different providers
python main.py image.png --provider ollama --model llava:7b
python main.py image.png --provider openai --model gpt-4o

🎯 Common Use Cases

  • 📄 Business Documents: Invoices, contracts, forms, receipts
  • 🍽️ Food & Restaurants: Recipes, menus, nutrition labels
  • 💰 Finance: Bank statements, tax documents, expense reports
  • 📚 Education: Homework, research papers, lecture notes
  • 🏥 Healthcare: Prescriptions, lab results, medical forms

See our Document Processing Tutorial for detailed examples.

🔧 Quick Commands

# Demo with sample images
pixi run demo-ocr-huggingface  # SmolVLM demo
pixi run demo-ocr-ollama       # LLaVA demo  
pixi run demo-ocr-openai       # OpenAI demo

# Test your setup
pixi run test-setup            # Validate installation
pixi run check-env             # Check API keys

# Process your files
pixi run ocr_llm "my-image.jpg"
pixi run ocr_ollama "document.pdf"

🧪 Handwriting OCR Test

Run the handwriting sample test to verify the SmolVLM transcription output.

  • Using Pixi (recommended, ensures model setup):
pixi run test-handwriting
  • Directly with Python:
python tests/test_handwriting_ocr.py

The test runs the SmolVLM pipeline against images/handwriting_sample.webp and checks the extracted text against the expected transcription. Use the Pixi command if you haven't run pixi run setup-smolvlm yet.

📂 Project Structure

vision-text-extractor/
├── main.py              # Main CLI application
├── agent/tools.py       # OCR extraction tools
├── tests/              # Test scripts
├── images/             # Sample images
├── wiki_content/       # Documentation source
├── LICENSE             # MIT License
└── pixi.toml          # Dependencies & tasks

🗺️ Roadmap & Future Updates

We're actively working on exciting new features! Here's what's planned:

🚀 Next Release (v0.2.0)

  • 📊 Batch Processing: Process multiple files in one command
  • 🎯 Output Formats: JSON, CSV, XML structured output options
  • 🔄 Result Caching: Skip reprocessing of identical images
  • 📈 Progress Bars: Visual feedback for long operations

🌟 Upcoming Features

  • 🧠 More AI Providers:
    • Google Gemini Vision
    • Anthropic Claude Vision
    • Local Qwen2-VL support
  • 🎨 Image Preprocessing:
    • Auto-rotate, denoise, enhance quality
    • OCR confidence scoring
  • 🔧 Advanced Tools:
    • Table structure extraction
    • Form field detection
    • Handwriting analysis mode

🏢 Enterprise Features

  • 🔐 Enhanced Security: SOC2 compliance, audit logs
  • Performance: GPU optimization, model quantization
  • 🌐 API Server: REST API for integration
  • 📊 Analytics: Usage metrics and accuracy reporting

🎯 Long-term Vision

  • 🤖 AI Agents: Multi-step document analysis workflows
  • 🌍 Multi-language: Better support for non-English text
  • 📱 Mobile App: Companion mobile application
  • 🔌 Integrations: Direct cloud storage, CRM, ERP connections

Want to contribute? Check our Issues or suggest new features!

🤝 Contributing

We welcome contributions! Please see our Wiki for development guides and check out existing Issues.

Ways to contribute:

  • 🐛 Bug Reports: Found an issue? Let us know!
  • 💡 Feature Requests: Suggest improvements
  • 📝 Documentation: Help improve our wiki
  • 🧪 Testing: Try new features and providers
  • 💻 Code: Submit pull requests

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary:

  • Commercial use - Use in commercial projects
  • Modification - Change and adapt the code
  • Distribution - Share with others
  • Private use - Use for personal projects
  • Warranty - No warranty provided

⚠️ Privacy Notice

  • Local providers (SmolVLM, LLaVA): Your data never leaves your machine
  • OpenAI provider: Data is sent to OpenAI's servers
  • API keys: Never commit .env files to version control

Need help? Check our Wiki or create an Issue 🚀

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.3%
  • Shell 4.7%