CeWLio is a powerful, Python-based Custom Word List Generator inspired by the original CeWL by Robin Wood. While CeWL is excellent for static HTML content, CeWLio brings modern web scraping capabilities to handle today's JavaScript-heavy websites. It crawls web pages, executes JavaScript, and extracts:
- π Unique words (with advanced filtering)
- π§ Email addresses
- π·οΈ Metadata (description, keywords, author)
Perfect for penetration testers, security researchers, and anyone needing high-quality, site-specific wordlists!
π€ AI-Assisted Development: This project was created with the help of AI tools, but solves real-world problems in web scraping and word list generation.
- JavaScript-Aware Extraction: Uses headless browser to render pages and extract content after JavaScript execution.
- Modern Web Support: Handles Single Page Applications (SPAs), infinite scroll, lazy loading, and dynamic content that traditional scrapers miss.
- Advanced Word Processing:
- Minimum/maximum word length filtering
- Lowercase conversion
- Alphanumeric or alpha-only words
- Umlaut conversion (Γ€βae, ΓΆβoe, ΓΌβue, Γβss)
- Word frequency counting
- Word Grouping: Generate multi-word phrases (e.g., 2-grams, 3-grams)
- Email & Metadata Extraction: Find emails from content and mailto links, extract meta tags
- Flexible Output: Save words, emails, and metadata to separate files or stdout
- Professional CLI: All features accessible via command-line interface with CeWL-compatible flags
- Silent Operation: Runs quietly by default, with optional debug output
- Comprehensive Testing: 100% test coverage
pip install cewliogit clone https://github.com/0xCardinal/cewlio
cd cewlio
pip install -e .- Python 3.12+
- Playwright (for browser automation)
- BeautifulSoup4 (for HTML parsing)
- Requests (for HTTP handling)
Note: After installing Playwright, you only need to install the chromium-headless-shell browser:
playwright install chromium-headless-shell# Extract words from a website (silent by default)
cewlio https://example.com
# Save words to a file
cewlio https://example.com --output wordlist.txt
# Include emails in stdout output
cewlio https://example.com -e
# Include metadata in stdout output
cewlio https://example.com -a
# Save emails and metadata to files
cewlio https://example.com --email_file emails.txt --meta_file meta.txtGenerate word groups with counts:
cewlio https://example.com --groups 3 -c --output phrases.txtCustom word filtering:
cewlio https://example.com -m 4 --max-length 12 --lowercase --convert-umlautsHandle JavaScript-heavy sites:
cewlio https://example.com -w 5 --visibleExtract only emails and metadata (no words):
cewlio https://example.com -e -aExtract only emails (no words):
cewlio https://example.com -eExtract only metadata (no words):
cewlio https://example.com -aSave emails to file (no words to stdout):
cewlio https://example.com --email_file emails.txt| Option | Description | Default |
|---|---|---|
url |
URL to process | Required |
--version |
Show version and exit | - |
--output |
Output file for words | stdout |
-e, --email |
Include email addresses in stdout output | False |
--email_file |
Output file for email addresses | - |
-a, --meta |
Include metadata in stdout output | False |
--meta_file |
Output file for metadata | - |
-m, --min_word_length |
Minimum word length | 3 |
--max-length |
Maximum word length | No limit |
--lowercase |
Convert words to lowercase | False |
--with-numbers |
Include words with numbers | False |
--convert-umlauts |
Convert umlaut characters | False |
-c, --count |
Show word counts | False |
--groups |
Generate word groups of specified size | - |
-w, --wait |
Wait time for JavaScript execution (seconds) | 0 |
--visible |
Show browser window | False |
--timeout |
Browser timeout (milliseconds) | 30000 |
--debug |
Show debug/summary output | False |
from cewlio import CeWLio
# Create instance with custom settings
cewlio = CeWLio(
min_word_length=4,
max_word_length=12,
lowercase=True,
convert_umlauts=True
)
# Process HTML content
html_content = "<p>Hello world! Contact us at test@example.com</p>"
cewlio.process_html(html_content)
# Access results
print("Words:", list(cewlio.words.keys()))
print("Emails:", list(cewlio.emails))
print("Metadata:", list(cewlio.metadata))import asyncio
from cewlio import CeWLio, process_url_with_cewlio
async def main():
cewlio = CeWLio()
success = await process_url_with_cewlio(
url="https://example.com",
cewlio_instance=cewlio,
wait_time=5,
headless=True
)
if success:
print(f"Found {len(cewlio.words)} words")
print(f"Found {len(cewlio.emails)} emails")
asyncio.run(main())The project includes a comprehensive test suite with 38 tests covering all functionality:
- β Core functionality tests (15 tests)
- β HTML extraction tests (3 tests)
- β URL processing tests (2 tests)
- β Integration tests (3 tests)
- β CLI argument validation tests (5 tests)
- β Edge case tests (10 tests)
Total: 38 tests with 100% success rate
For detailed testing information and development setup, see CONTRIBUTING.md.
"No module named 'playwright'"
pip install playwright
playwright install chromium-headless-shellJavaScript-heavy sites not loading properly
# Increase wait time for JavaScript execution
cewlio https://example.com -w 10Browser timeout errors
# Increase timeout and wait time
cewlio https://example.com --timeout 60000 -w 5We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:
- π Getting started with development
- π Code style and formatting guidelines
- π§ͺ Testing requirements and procedures
- π Submitting pull requests
- π Reporting issues
- π‘ Feature requests
Quick start:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
For detailed development setup and guidelines, see CONTRIBUTING.md.
- Inspired by CeWL by Robin Wood
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
Made with β€οΈ for the security community