High-performance asynchronous web scraper that extracts business intelligence from domains: metadata, emails, social media profiles, and technology stack detection.
- 🚀 High Performance: 50 concurrent workers with asyncio
- 🔄 Queue-based Architecture: Redis for distributed task management
- 🗄️ Database Integration: MySQL for persistent storage
- 🌐 Browser Impersonation: Bypass bot detection with curl_cffi
- 🛡️ Robust Error Handling: Auto-reconnection and retry logic
- 📊 Real-time Progress Tracking: Monitor processing status
- 🔍 Technology Detection: Identifies CMS, frameworks, and marketing tools
- 📧 Contact Extraction: Emails and social media profiles
- 🏪 E-commerce Detection: Flags online stores automatically
| Data Type | Description |
|---|---|
| Status Code | HTTP response code |
| Title | Page title (cleaned & validated) |
| Description | Meta description (up to 500 chars) |
| Emails | Up to 5 validated emails (spam-filtered) |
| Social Media | Facebook, Instagram, LinkedIn, Twitter/X |
| Tech Stack | WordPress, Shopify, React, Vue, Angular, etc. |
| E-commerce | Automatically identifies online stores |
| Marketing Tools | Google Analytics, Facebook Pixel, TikTok Pixel, Hotjar, etc. |
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ MySQL │─────▶│ Redis │─────▶│ 50 Workers │
│ (Domains) │ │ (Queue) │ │ (Async I/O) │
└─────────────┘ └─────────────┘ └────────┬────────┘
│
▼
┌─────────────────┐
│ MySQL │
│ (Results) │
└─────────────────┘
Flow:
load_redis.pyloads unprocessed domains from MySQL to Redis queue- 50 async workers pull domains from Redis
- Each worker fetches, parses, and analyzes the domain
- Results are saved back to MySQL with retry logic
This project runs entirely in Docker containers for easy deployment and isolation.
- Docker Engine 20.10+
- Docker Compose 2.0+
- 2GB RAM minimum
- 10GB disk space
- Clone the repository
git clone https://github.com/yourusername/web-intelligence-scraper.git
cd web-intelligence-scraper- Configure environment
cp .env.example .env
# Edit .env with your database credentials- Start Docker services
docker-compose up -d- Verify services are running
docker-compose ps- Load domains into Redis queue
docker exec -it scraper-app python scripts/load_redis.py- Start processing
docker exec -it scraper-app python scripts/main.pyCONCURRENCY_LIMIT = 50 # Number of parallel workers
REQUEST_TIMEOUT = 15 # HTTP request timeout (seconds)
HARD_TIMEOUT = 45 # Maximum time per domain
MAX_DOWNLOAD_SIZE = 3 * 1024 * 1024 # Skip pages larger than 3MBEdit the configuration in both scripts:
DB_CONFIG = {
'host': 'db',
'user': 'root',
'password': 'your_password',
'database': 'your_db_name'
}
TABLE_NAME = 'your_tableName'
REDIS_KEY = 'cola_dominios'- Speed: ~1,000-2,000 domains/hour (network dependent)
- Concurrency: 50 workers processing in parallel
- Memory: ~500MB-1GB RAM usage
- Efficiency: Async I/O prevents blocking
- Reliability: Auto-retry on failures
- Increase
CONCURRENCY_LIMITfor faster processing (requires more RAM) - Adjust
REQUEST_TIMEOUTbased on target sites - Use SSD storage for better MySQL performance
- Deploy on cloud with good network connectivity
Required MySQL table structure:
CREATE TABLE your_table (
id INT PRIMARY KEY AUTO_INCREMENT,
dominios VARCHAR(255) NOT NULL,
status_code INT DEFAULT 0,
title VARCHAR(255),
description TEXT,
emails JSON,
socials JSON,
tech_stack JSON,
is_ecommerce TINYINT(1) DEFAULT 0,
has_ads TINYINT(1) DEFAULT 0,
last_checked DATETIME,
INDEX idx_status (status_code),
INDEX idx_ecommerce (is_ecommerce),
INDEX idx_ads (has_ads)
);# 1. Load domains from MySQL to Redis
docker exec -it scraper-app python scripts/load_redis.py
# 2. Start scraping
docker exec -it scraper-app python scripts/main.py
# 3. Monitor progress in real-time
# Logs show progress every 100 domains# Check Redis queue size
docker exec -it redis redis-cli LLEN cola_dominios
# Check processed count
docker exec -it mysql mysql -u root -p -e \
"SELECT COUNT(*) FROM db.table WHERE status_code > 0"
# View successful scrapes
docker exec -it mysql mysql -u root -p -e \
"SELECT dominios, title, is_ecommerce FROM db.table WHERE status_code = 200 LIMIT 10"
# Check for errors
docker exec -it mysql mysql -u root -p -e \
"SELECT status_code, COUNT(*) as count FROM db.table GROUP BY status_code"-- Reset failed domains for retry
UPDATE db.table SET status_code = 0 WHERE status_code != 200;Then reload the queue and restart the scraper.
The scraper includes comprehensive error handling:
- ✅ Auto-reconnection: Redis and MySQL connections auto-recover
- ✅ Exponential backoff: Gradual retry delays
- ✅ Graceful degradation: Workers continue on partial failures
- ✅ Email filtering: Removes spam/placeholder emails
- ✅ Social media validation: Filters share buttons
- ✅ Domain format checking: Validates before processing
- ✅ Title/description cleaning: Removes errors and invalid content
- ✅ Timeout protection: Hard limits prevent hanging
- ✅ Memory cleanup: Explicit deletion of large objects
- ✅ Connection pooling: Efficient resource usage
- ✅ Graceful shutdown: Proper cleanup on stop
# Reset 100 domains for testing
docker exec -it mysql mysql -u root -p webs -e \
"UPDATE españa2 SET status_code = 0 LIMIT 100"
# Load and process
docker exec -it scraper-app python scripts/load_redis.py
docker exec -it scraper-app python scripts/main.py-- Check results
SELECT
status_code,
COUNT(*) as count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM table
WHERE last_checked > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY status_code;| Component | Technology |
|---|---|
| Language | Python 3.9+ |
| Async Framework | asyncio |
| HTTP Client | curl-cffi (browser impersonation) |
| HTML Parser | BeautifulSoup4 |
| Queue | Redis |
| Database | MySQL 8.0 |
| Containerization | Docker & Docker Compose |
asyncio>=3.4.3
beautifulsoup4>=4.12.0
curl-cffi>=0.6.0
mysql-connector-python>=8.2.0
redis>=5.0.0
lxml>=4.9.0The scraper identifies:
- WordPress, Shopify, PrestaShop, Wix, Squarespace, Magento, Joomla, Drupal
- React, Vue.js, Angular, Next.js, Nuxt.js
- Bootstrap, Tailwind CSS
- Google Analytics, Facebook Pixel, Google Ads, TikTok Pixel, Hotjar, Klaviyo
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 guidelines
- Add docstrings to functions
- Include type hints where appropriate
- Write descriptive commit messages
This project is licensed under the MIT License - see the LICENSE file for details.
This tool is for educational and research purposes. Always:
- ✅ Respect
robots.txtdirectives - ✅ Follow website terms of service
- ✅ Implement appropriate rate limiting
- ✅ Use responsibly and ethically
- ❌ Don't use for unauthorized data harvesting
- ❌ Don't overload target servers
For issues and questions:
- 🐛 Open an Issue
- 💬 Join Discussions
- 📧 Email: contacto@rainyisdev.cc
- Export results to CSV/JSON
- Web dashboard for real-time monitoring
- API endpoint for on-demand queries
- Multi-language content detection
- Machine learning for site classification
- Screenshot capture capability
- WHOIS data integration
- Sitemap parsing
- Robots.txt compliance checker
Built with:
- curl-cffi for browser impersonation
- BeautifulSoup for HTML parsing
- Redis for queue management
- MySQL for data persistence
Made with ❤️ for data intelligence and web research
Star ⭐ this repo if you find it useful!