A comprehensive, production-ready toolkit for downloading, extracting, parsing, and validating West Bengal electoral roll data from the 2002 base electoral roll for Special Intensive Revision (SIR) 2025 targeting the Bidhansabha Legislative Assembly Election 2026.
west bengal electoral roll • voter list extraction • election data scraping • pdf data extraction • india election data • electoral roll parser • voter registration data • assembly constituency data • polling booth data • epic number validation • eci gateway api • bidhansabha election 2026 • sir 2025 • python election tools • web scraping india • cid font decoding • bengal voter data • legislative assembly election
This project provides automated tools for processing West Bengal electoral data:
- 📥 Download electoral roll PDFs from CEO West Bengal
- 📄 Extract voter information from PDFs (handling CID font encoding)
- 🔍 Parse and structure voter data (Name, Age, EPIC, Address, Gender, etc.)
- ✅ Validate data against official ECI Gateway API
- 💾 Store data in structured formats (SQLite, JSON, YAML, CSV, Excel)
- 🌐 Web interface for browsing, searching, and validation
The official CEO West Bengal website presents several challenges for data access:
-
CAPTCHA Barriers 🔒
- Official website requires CAPTCHA for each PDF download
- This tool bypasses CAPTCHA by directly accessing PDF URLs
- Enables automated batch downloading of hundreds of PDFs
-
Unsearchable PDFs 🚫
- PDFs on the official website contain garbled/encoded text
- Standard Ctrl+F search doesn't work on these PDFs
- Text appears as:
*RXWDP%DQHUMHHinstead ofJOHN DOE
-
CID Font Decoding 🔤
- This tool implements comprehensive CID (Character Identifier) font mapping
- Converts garbled characters to readable text automatically
- Makes data searchable, sortable, and analyzable
-
Data Validation ✅
- Cross-verifies extracted voter data against official ECI Gateway API
- Ensures data accuracy and completeness
- Validates EPIC numbers, names, and demographic information
- Uses fuzzy matching (95% threshold) for name comparison
- See
src/validator.pyfor implementation details
-
Structured Data Export 💾
- Converts unstructured PDF data into structured formats
- Exports to SQLite, JSON, CSV, Excel for easy analysis
- Maintains data integrity throughout the pipeline
In short: This tool makes publicly available electoral data actually usable for research, analysis, and transparency initiatives.
- Base Electoral Roll: 2002 (All extracted data is from 2002 base electoral roll)
- Revision Type: Special Intensive Revision (SIR) 2025
- Target Election: Bidhansabha (Legislative Assembly) Election 2026
- Coverage: 21 Districts, 294 Assembly Constituencies, 61,531+ Polling Parts
- Official Sources: CEO West Bengal & Election Commission of India (ECI)
- ✅ 229,487 voters extracted from AC_139 (Belgachia East)
- ✅ 61,531 polling parts cataloged across West Bengal
- ✅ 294 assembly constituencies mapped
- ✅ 21 districts covered
- ✅ 99.99% extraction accuracy with robust error handling
- ✅ 100% API success rate for metadata collection
- ✅ ~5,100 voters/minute extraction speed
For detailed analytics, see ANALYTICS.md
| Document | Description |
|---|---|
| SETUP_GUIDE.md | Complete setup instructions, prerequisites, troubleshooting |
| VALIDATION_REPORT.md | PDF download validation report (100% complete - 7,936 PDFs) |
| PDF_STRUCTURE.md | PDF organization structure and naming conventions |
| ANALYTICS.md | Detailed statistics, performance metrics, data quality analysis |
| API_VERIFICATION_GUIDE.md | API verification system usage and reference |
| IMPLEMENTATION_SUMMARY.md | Implementation details and feature overview |
| REPOSITORY_SUMMARY.md | Technical overview, architecture, and codebase structure |
| API.md | ECI Gateway API reference and usage guide |
| PDF_FORMAT.md | PDF structure, CID font decoding, extraction techniques |
| LICENSE | MIT License with data privacy notice |
| Metric | Value | Coverage |
|---|---|---|
| PDFs Downloaded | 353 | 100% |
| Voters Extracted | 229,487 | - |
| Age Data | 229,464 | 99.99% |
| EPIC Numbers | 190,195 | 82.87% |
| Gender Data | 229,487 | 100% |
| Address Data | 228,900 | 99.74% |
| Extraction Accuracy | - | 99.99% |
| Category | Count | Status |
|---|---|---|
| Districts | 21 | ✅ All cataloged |
| Assembly Constituencies | 294 | ✅ All mapped |
| Polling Parts | 61,531+ | ✅ Complete metadata |
| Voters (AC_139 only) | 229,487 | ✅ Extracted & validated |
- ⚡ Extraction Speed: ~5,100 voters/minute
- 💯 API Success Rate: 100% (316 API calls)
- 🎯 Data Integrity: 0% corruption, full validation
- 🔧 Processing Time: ~7.6 seconds per PDF
Based on AC_139 statistics:
- Full West Bengal: ~67 million voters (estimated)
- Total Storage: ~200 GB for complete dataset
- Processing Time: ~220 hours for all 294 assemblies
For complete analytics including data quality analysis, demographic insights, and technical metrics, see ANALYTICS.md
┌─────────────────┐ ┌──────────────┐ ┌─────────────┐
│ CEO WB Website │────▶│ PDF Download │────▶│ Text Extract│
└─────────────────┘ └──────────────┘ └─────────────┘
│
▼
┌─────────────────┐ ┌──────────────┐ ┌─────────────┐
│ ECI Gateway │◀────│ Validation │◀────│ Parsing │
│ API │ │ Layer │ │ │
└─────────────────┘ └──────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ JSON/YAML │
│ Storage │
└─────────────┘
- Smart PDF Downloading: Batch download with resume capability
- CID Font Decoding: Handles special character encoding in PDFs
- Robust Parsing: Extracts 19+ fields per voter record
- API Validation: Cross-verify with official ECI data
- Web Dashboard: Interactive interface for data exploration
- Modular Design: Easy to extend and customize
data/ directory contains voter and personal information and is NOT committed to Git.
All downloaded PDFs, API metadata, and databases are stored in organized subdirectories:
data/downloaded_pdfs/ALL/- Electoral roll PDFs from CEO West Bengaldata/api_metadata/api/- JSON files with district/AC/part metadatadata/reference_docs/other-pdfs/- Official forms and reference documentsdata/electoral_roll.db- SQLite database with extracted voter data
See data/README.md for complete documentation.
These directories are protected by .gitignore to prevent accidental commits of sensitive data.
# Clone the repository
git clone https://github.com/partha-dhar/wb-electoral-data.git
cd wb-electoral-data
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Copy example config
cp config/config.example.yaml config/config.yaml
# Edit configuration (optional)
nano config/config.yamlNew Universal Download Script (Recommended):
# List all available districts
./scripts/download_electoral_rolls.sh --list-districts
# List ACs in a district
./scripts/download_electoral_rolls.sh --list-acs 13
# Download entire district
./scripts/download_electoral_rolls.sh --district 13
# Download specific AC
./scripts/download_electoral_rolls.sh --district 13 --ac 146
# Example: Download Kolkata South
./scripts/download_electoral_rolls.sh --district 13
# Downloads 1,366 PDFs for 7 ACs in organized structureLegacy Python Scripts (Still available):
# Download for specific Assembly Constituency
python scripts/download_pdfs.py --ac 139
# Download for entire district
python scripts/download_pdfs.py --district 10
# Download all West Bengal
python scripts/download_pdfs.py --allValidate PDFs against metadata (Checks existence, readability, completeness):
# Validate all downloaded districts
./scripts/validate_downloads.sh
# Check validation results
cat docs/VALIDATION_REPORT.mdCurrent Status:
- ✅ District 9 (North 24 Parganas): 6,570 PDFs - 100% validated
- ✅ District 13 (Kolkata South): 1,366 PDFs - 100% validated
- ✅ Total: 7,936 PDFs across 35 ACs
# Extract from specific AC
python scripts/extract_voters.py --ac 139
# Batch extract all downloaded PDFs
python scripts/extract_voters.py --batch
# Extract specific district
python scripts/extract_voters.py --district 13# Validate extracted data against ECI API
python scripts/validate_data.py --ac 139
# Generate validation report
python scripts/validate_data.py --ac 139 --report# Verify extracted voters using ECI's individual voter API
# Note: API has incomplete data, used only for verification
python scripts/verify_with_api.py --ac 139
# Custom delay between API calls (default: 0.5s)
python scripts/verify_with_api.py --ac 139 --delay 1.0
# View verification statistics
python scripts/verify_with_api.py --ac 139 --batch-size 50Why API Verification?
- The ECI API endpoint (
get-eroll-data-2003) returns individual voter data - No Bearer token required - Works with basic headers only!
- Used for verification purposes as API data is incomplete (missing names)
- Primary method remains PDF extraction for complete data
- Verification results stored in SQLite database
- Rate limited: 50 requests/second (we use 2/second to be safe)
# Start web server
python web/app.py
# Access at http://localhost:5000wb-electoral-data/
├── src/
│ ├── __init__.py
│ ├── downloader.py # PDF downloading logic
│ ├── extractor.py # Text extraction with CID decoding
│ ├── parser.py # Voter data parsing
│ ├── validator.py # ECI API validation
│ ├── storage.py # JSON/YAML storage & SQLite DB
│ └── utils.py # Helper functions
├── scripts/
│ ├── download_pdfs.py # CLI for downloading
│ ├── extract_voters.py # CLI for extraction
│ ├── validate_data.py # CLI for validation
│ ├── verify_with_api.py # API verification (NEW!)
│ └── fetch_metadata.py # Fetch district/AC metadata
├── web/
│ ├── app.py # Flask web application
│ ├── templates/ # HTML templates
│ └── static/ # CSS, JS, images
├── config/
│ ├── config.yaml # Main configuration
│ └── config.example.yaml
├── docs/
│ ├── SETUP_GUIDE.md # Setup instructions
│ ├── ANALYTICS.md # Statistics and analytics
│ ├── API_VERIFICATION_GUIDE.md # API verification guide
│ ├── API.md # API documentation
│ ├── PDF_FORMAT.md # PDF structure details
│ └── REPOSITORY_SUMMARY.md # Architecture overview
├── data/ # Data directory (NOT in Git)
│ ├── downloaded_pdfs/ # Electoral roll PDFs
│ │ └── ALL/ # Organized by District/AC/Part
│ ├── api_metadata/ # API metadata JSON files
│ │ └── api/ # District/AC metadata
│ ├── reference_docs/ # Official forms and docs
│ │ └── other-pdfs/ # Reference PDFs
│ ├── electoral_roll.db # SQLite database
│ └── README.md # Data directory documentation
├── tests/
│ ├── test_downloader.py
│ ├── test_extractor.py
│ └── test_parser.py
├── .gitignore
├── requirements.txt
├── setup.py
├── LICENSE
└── README.md
Edit config/config.yaml:
# Data directories (excluded from git)
data_dir: ./data
pdf_dir: ./data/pdfs
output_dir: ./data/output
# API settings
eci_api:
base_url: https://gateway-voters.eci.gov.in/api/v1/citizen/sir
state_code: S25
timeout: 30
# Download settings
download:
concurrent: 5
retry_attempts: 3
delay: 1.0{
"metadata": {
"ac_number": 139,
"ac_name": "BELGACHIA EAST",
"part_number": 1,
"extraction_date": "2025-11-10",
"total_voters": 650
},
"voters": [
{
"serial_no": 1,
"name": "JOHN DOE",
"relative_name": "ROBERT DOE",
"relation_type": "Father",
"house_no": "123/A",
"age": 45,
"gender": "M",
"epic_no": "WB/12/345/678901"
}
]
}The web interface provides:
- Browse: Navigate districts → assemblies → parts
- Search: Find voters by name, EPIC, or address
- Validate: Check against official ECI API
- Export: Download data in JSON/CSV/Excel
- Statistics: Visualize voter demographics
Usage Guidelines:
- ✅ Educational and research purposes
- ✅ Electoral analysis and demographic studies
- ✅ Transparency and accountability initiatives
- ❌ Commercial exploitation without authorization
- ❌ Voter privacy violations
- ❌ Unauthorized data selling or sharing
Compliance:
- Follow Election Commission of India (ECI) guidelines
- Respect data protection regulations
- Use data responsibly and ethically
- Maintain data accuracy and integrity
Contributions are welcome! We appreciate help with:
- 🐛 Bug fixes and error handling improvements
- 📚 Documentation enhancements
- ✨ New features and functionality
- 🧪 Test coverage and quality assurance
- 🌐 Internationalization and localization
Before contributing, please:
- Read SETUP_GUIDE.md for development setup
- Review GitHub Copilot Instructions for project guidelines and best practices
- Check existing issues and pull requests
- Follow the coding standards (PEP 8, Black formatting)
- Add tests for new features
- Update documentation accordingly
Contribution Process:
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request with detailed description
| Category | Technologies |
|---|---|
| Language | Python 3.8+ |
| PDF Processing | pdfplumber, PyPDF2 |
| Web Framework | Flask, Jinja2 |
| Database | SQLite3, pandas |
| HTTP Client | requests, urllib3 |
| Data Formats | JSON, YAML, CSV, Excel |
| Testing | pytest, unittest |
| Deployment | Docker-ready, WSGI compatible |
This toolkit is valuable for:
- Electoral Research: Analyze voter demographics and distribution patterns
- Academic Studies: Research on electoral behavior and political geography
- Civic Tech Projects: Build transparency and accountability tools
- Data Journalism: Investigate electoral trends and anomalies
- Government Analytics: Support policy planning and resource allocation
- NGO Initiatives: Monitor electoral processes and voter registration drives
# Quick start with setup script
./setup.sh
source venv/bin/activate
python web/app.py- Web Server: Nginx + Gunicorn
- Database: PostgreSQL (for larger datasets)
- Caching: Redis
- Monitoring: Prometheus + Grafana
- Containerization: Docker + Docker Compose
See SETUP_GUIDE.md for detailed deployment instructions.
This project is licensed under the MIT License - see LICENSE file for details.
Additional Terms:
- Data sourced from public government websites (CEO West Bengal, ECI)
- Use responsibly and comply with local regulations
- Attribution required for derivative works
- No warranty provided - use at your own risk
- Election Commission of India (ECI) - For providing electoral data and API access
- CEO West Bengal - For maintaining the electoral roll website and PDF archives
- Python Community - For excellent libraries (pdfplumber, Flask, pandas)
- Contributors - Everyone who helps improve this toolkit
- Open Source Community - For inspiration and best practices
- 📖 Documentation: Check SETUP_GUIDE.md for setup issues
- 🐛 Bug Reports: Open an issue on GitHub with detailed description
- 💡 Feature Requests: Submit an issue with "enhancement" label
- 💬 Discussions: Use GitHub Discussions for questions and ideas
- ⭐ Star this repository to get updates
- 👁️ Watch releases for new versions
- 🍴 Fork to create your own customizations
- CEO West Bengal - Electoral Roll Website
- ECI Gateway - Official Voter Search API
- Election Commission of India - National Electoral Body
- Setup Guide - Installation and configuration
- Analytics Report - Data statistics and insights
- API Documentation - ECI Gateway API reference
- PDF Format Guide - PDF structure and decoding
- Repository Summary - Technical overview
- Electoral roll analysis tools for other Indian states
- Voter registration verification systems
- Political boundary mapping tools
- Demographic analysis frameworks
#WestBengal #ElectoralRoll #VoterData #IndiaElections #OpenData #CivicTech #ElectionData #Python #DataExtraction #PDFParsing #APIIntegration #BidhansabhaElection2026 #SIR2025 #ElectionCommissionOfIndia #DemocracyData #TransparencyTools #ElectoralAnalysis #LegislativeAssembly
West Bengal Electoral Data Extraction Tool
Empowering transparency and research through open electoral data
Disclaimer: This tool processes publicly available electoral roll data for educational, research, and transparency purposes. Users must comply with Election Commission of India guidelines and local data protection regulations. The authors are not responsible for misuse of this tool or the data it processes. Always verify data accuracy from official sources.