Extract text from images and documents using multiple AI providers. Choose from local models (SmolVLM, LLaVA) or cloud-based (OpenAI GPT-4o) for maximum flexibility.
- 🤖 3 AI Providers: Local SmolVLM/LLaVA or cloud OpenAI GPT-4o
- 🔒 Privacy-First: Local processing keeps your data private
- 🌐 Flexible Input: Local files or web URLs
- 💬 Custom Prompts: Extract specific information
- ⚡ Easy Setup: One-command installation with Pixi
# Clone and install
git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install
# Quick demo
pixi run demo-ocr-huggingface
# Use with your images
python main.py path/to/your/image.jpg
python main.py "https://example.com/image.png"For detailed guides and tutorials, visit our Wiki:
- 📋 Installation Guide - Complete setup for all providers
- 🚀 Quick Start Tutorial - Get started in 5 minutes
- 📊 Provider Comparison - Choose the right AI model
- 📄 Document Processing - Real-world examples
- ⚙️ Pixi Tasks Reference - All available commands
- Pixi package manager
- Python 3.10+ (managed by Pixi)
git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install
pixi run setup🟢 Local & Free (Recommended)
pixi run setup-smolvlm # Hugging Face SmolVLM (~2GB)
pixi run setup-ollama # Ollama LLaVA (~4GB) 🟡 Cloud & Paid (Highest Accuracy)
# Add your OpenAI key to .env file
echo "OPENAI_API_KEY=your_key_here" >> .env# Extract text from any image
python main.py path/to/your/image.jpg
# Process web images
python main.py "https://example.com/document.png"
# Custom extraction prompt
python main.py receipt.jpg --prompt "Extract total amount and date"
# Try different providers
python main.py image.png --provider ollama --model llava:7b
python main.py image.png --provider openai --model gpt-4o- 📄 Business Documents: Invoices, contracts, forms, receipts
- 🍽️ Food & Restaurants: Recipes, menus, nutrition labels
- 💰 Finance: Bank statements, tax documents, expense reports
- 📚 Education: Homework, research papers, lecture notes
- 🏥 Healthcare: Prescriptions, lab results, medical forms
See our Document Processing Tutorial for detailed examples.
# Demo with sample images
pixi run demo-ocr-huggingface # SmolVLM demo
pixi run demo-ocr-ollama # LLaVA demo
pixi run demo-ocr-openai # OpenAI demo
# Test your setup
pixi run test-setup # Validate installation
pixi run check-env # Check API keys
# Process your files
pixi run ocr_llm "my-image.jpg"
pixi run ocr_ollama "document.pdf"Run the handwriting sample test to verify the SmolVLM transcription output.
- Using Pixi (recommended, ensures model setup):
pixi run test-handwriting- Directly with Python:
python tests/test_handwriting_ocr.pyThe test runs the SmolVLM pipeline against images/handwriting_sample.webp and checks the extracted text against the expected transcription. Use the Pixi command if you haven't run pixi run setup-smolvlm yet.
vision-text-extractor/
├── main.py # Main CLI application
├── agent/tools.py # OCR extraction tools
├── tests/ # Test scripts
├── images/ # Sample images
├── wiki_content/ # Documentation source
├── LICENSE # MIT License
└── pixi.toml # Dependencies & tasks
We're actively working on exciting new features! Here's what's planned:
- 📊 Batch Processing: Process multiple files in one command
- 🎯 Output Formats: JSON, CSV, XML structured output options
- 🔄 Result Caching: Skip reprocessing of identical images
- 📈 Progress Bars: Visual feedback for long operations
- 🧠 More AI Providers:
- Google Gemini Vision
- Anthropic Claude Vision
- Local Qwen2-VL support
- 🎨 Image Preprocessing:
- Auto-rotate, denoise, enhance quality
- OCR confidence scoring
- 🔧 Advanced Tools:
- Table structure extraction
- Form field detection
- Handwriting analysis mode
- 🔐 Enhanced Security: SOC2 compliance, audit logs
- ⚡ Performance: GPU optimization, model quantization
- 🌐 API Server: REST API for integration
- 📊 Analytics: Usage metrics and accuracy reporting
- 🤖 AI Agents: Multi-step document analysis workflows
- 🌍 Multi-language: Better support for non-English text
- 📱 Mobile App: Companion mobile application
- 🔌 Integrations: Direct cloud storage, CRM, ERP connections
Want to contribute? Check our Issues or suggest new features!
We welcome contributions! Please see our Wiki for development guides and check out existing Issues.
Ways to contribute:
- 🐛 Bug Reports: Found an issue? Let us know!
- 💡 Feature Requests: Suggest improvements
- 📝 Documentation: Help improve our wiki
- 🧪 Testing: Try new features and providers
- 💻 Code: Submit pull requests
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License Summary:
- ✅ Commercial use - Use in commercial projects
- ✅ Modification - Change and adapt the code
- ✅ Distribution - Share with others
- ✅ Private use - Use for personal projects
- ❓ Warranty - No warranty provided
- Local providers (SmolVLM, LLaVA): Your data never leaves your machine
- OpenAI provider: Data is sent to OpenAI's servers
- API keys: Never commit
.envfiles to version control