CeWLio 🕵️‍♂️✨

CeWLio is a powerful, Python-based Custom Word List Generator inspired by the original CeWL by Robin Wood. While CeWL is excellent for static HTML content, CeWLio brings modern web scraping capabilities to handle today's JavaScript-heavy websites. It crawls web pages, executes JavaScript, and extracts:

📚 Unique words (with advanced filtering)
📧 Email addresses
🏷️ Metadata (description, keywords, author)

Perfect for penetration testers, security researchers, and anyone needing high-quality, site-specific wordlists!

🤖 AI-Assisted Development: This project was created with the help of AI tools, but solves real-world problems in web scraping and word list generation.

🚀 Features

JavaScript-Aware Extraction: Uses headless browser to render pages and extract content after JavaScript execution.
Modern Web Support: Handles Single Page Applications (SPAs), infinite scroll, lazy loading, and dynamic content that traditional scrapers miss.
Advanced Word Processing:
- Minimum/maximum word length filtering
- Lowercase conversion
- Alphanumeric or alpha-only words
- Umlaut conversion (ä→ae, ö→oe, ü→ue, ß→ss)
- Word frequency counting
Word Grouping: Generate multi-word phrases (e.g., 2-grams, 3-grams)
Email & Metadata Extraction: Find emails from content and mailto links, extract meta tags
Flexible Output: Save words, emails, and metadata to separate files or stdout
Professional CLI: All features accessible via command-line interface with CeWL-compatible flags
Silent Operation: Runs quietly by default, with optional debug output
Comprehensive Testing: 100% test coverage

🛠️ Installation

From PyPI (Recommended)

pip install cewlio

From Source

git clone https://github.com/0xCardinal/cewlio
cd cewlio
pip install -e .

Dependencies

Python 3.12+
Playwright (for browser automation)
BeautifulSoup4 (for HTML parsing)
Requests (for HTTP handling)

Note: After installing Playwright, you only need to install the chromium-headless-shell browser:

playwright install chromium-headless-shell

⚡ Quick Start

Basic Usage

# Extract words from a website (silent by default)
cewlio https://example.com

# Save words to a file
cewlio https://example.com --output wordlist.txt

# Include emails in stdout output
cewlio https://example.com -e

# Include metadata in stdout output
cewlio https://example.com -a

# Save emails and metadata to files
cewlio https://example.com --email_file emails.txt --meta_file meta.txt

More Examples

Generate word groups with counts:

cewlio https://example.com --groups 3 -c --output phrases.txt

Custom word filtering:

cewlio https://example.com -m 4 --max-length 12 --lowercase --convert-umlauts

Handle JavaScript-heavy sites:

cewlio https://example.com -w 5 --visible

Extract only emails and metadata (no words):

cewlio https://example.com -e -a

Extract only emails (no words):

cewlio https://example.com -e

Extract only metadata (no words):

cewlio https://example.com -a

Save emails to file (no words to stdout):

cewlio https://example.com --email_file emails.txt

🎛️ Command-Line Options

Option	Description	Default
`url`	URL to process	Required
`--version`	Show version and exit	-
`--output`	Output file for words	stdout
`-e, --email`	Include email addresses in stdout output	False
`--email_file`	Output file for email addresses	-
`-a, --meta`	Include metadata in stdout output	False
`--meta_file`	Output file for metadata	-
`-m, --min_word_length`	Minimum word length	3
`--max-length`	Maximum word length	No limit
`--lowercase`	Convert words to lowercase	False
`--with-numbers`	Include words with numbers	False
`--convert-umlauts`	Convert umlaut characters	False
`-c, --count`	Show word counts	False
`--groups`	Generate word groups of specified size	-
`-w, --wait`	Wait time for JavaScript execution (seconds)	0
`--visible`	Show browser window	False
`--timeout`	Browser timeout (milliseconds)	30000
`--debug`	Show debug/summary output	False

📚 API Usage

Basic Python Usage

from cewlio import CeWLio

# Create instance with custom settings
cewlio = CeWLio(
    min_word_length=4,
    max_word_length=12,
    lowercase=True,
    convert_umlauts=True
)

# Process HTML content
html_content = "<p>Hello world! Contact us at test@example.com</p>"
cewlio.process_html(html_content)

# Access results
print("Words:", list(cewlio.words.keys()))
print("Emails:", list(cewlio.emails))
print("Metadata:", list(cewlio.metadata))

Process URLs

import asyncio
from cewlio import CeWLio, process_url_with_cewlio

async def main():
    cewlio = CeWLio()
    success = await process_url_with_cewlio(
        url="https://example.com",
        cewlio_instance=cewlio,
        wait_time=5,
        headless=True
    )
    
    if success:
        print(f"Found {len(cewlio.words)} words")
        print(f"Found {len(cewlio.emails)} emails")

asyncio.run(main())

🧪 Testing

The project includes a comprehensive test suite with 38 tests covering all functionality:

✅ Core functionality tests (15 tests)
✅ HTML extraction tests (3 tests)
✅ URL processing tests (2 tests)
✅ Integration tests (3 tests)
✅ CLI argument validation tests (5 tests)
✅ Edge case tests (10 tests)

Total: 38 tests with 100% success rate

For detailed testing information and development setup, see CONTRIBUTING.md.

🐛 Troubleshooting

Common Issues

"No module named 'playwright'"

pip install playwright
playwright install chromium-headless-shell

JavaScript-heavy sites not loading properly

# Increase wait time for JavaScript execution
cewlio https://example.com -w 10

Browser timeout errors

# Increase timeout and wait time
cewlio https://example.com --timeout 60000 -w 5

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:

🚀 Getting started with development
📝 Code style and formatting guidelines
🧪 Testing requirements and procedures
🔄 Submitting pull requests
🐛 Reporting issues
💡 Feature requests

Quick start:

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

For detailed development setup and guidelines, see CONTRIBUTING.md.

🙏 Credits

Inspired by CeWL by Robin Wood

📞 Support

🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Made with ❤️ for the security community

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cewlio		cewlio
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CeWLio 🕵️‍♂️✨

🚀 Features

🛠️ Installation

From PyPI (Recommended)

From Source

Dependencies

⚡ Quick Start

Basic Usage

More Examples

🎛️ Command-Line Options

📚 API Usage

Basic Python Usage

Process URLs

🧪 Testing

🐛 Troubleshooting

Common Issues

🤝 Contributing

🙏 Credits

📞 Support

About

Uh oh!

Releases

Packages

Languages

License

0xCardinal/CeWLio

Folders and files

Latest commit

History

Repository files navigation

CeWLio 🕵️‍♂️✨

🚀 Features

🛠️ Installation

From PyPI (Recommended)

From Source

Dependencies

⚡ Quick Start

Basic Usage

More Examples

🎛️ Command-Line Options

📚 API Usage

Basic Python Usage

Process URLs

🧪 Testing

🐛 Troubleshooting

Common Issues

🤝 Contributing

🙏 Credits

📞 Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages