Skip to content

A powerful Python-based Custom Word List Generator that crawls web pages with JavaScript support to extract words, emails, and metadata for security testing and research.

License

Notifications You must be signed in to change notification settings

0xCardinal/CeWLio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CeWLio πŸ•΅οΈβ€β™‚οΈβœ¨

AI-Assisted Development Python Tests

CeWLio is a powerful, Python-based Custom Word List Generator inspired by the original CeWL by Robin Wood. While CeWL is excellent for static HTML content, CeWLio brings modern web scraping capabilities to handle today's JavaScript-heavy websites. It crawls web pages, executes JavaScript, and extracts:

  • πŸ“š Unique words (with advanced filtering)
  • πŸ“§ Email addresses
  • 🏷️ Metadata (description, keywords, author)

Perfect for penetration testers, security researchers, and anyone needing high-quality, site-specific wordlists!

πŸ€– AI-Assisted Development: This project was created with the help of AI tools, but solves real-world problems in web scraping and word list generation.


πŸš€ Features

  • JavaScript-Aware Extraction: Uses headless browser to render pages and extract content after JavaScript execution.
  • Modern Web Support: Handles Single Page Applications (SPAs), infinite scroll, lazy loading, and dynamic content that traditional scrapers miss.
  • Advanced Word Processing:
    • Minimum/maximum word length filtering
    • Lowercase conversion
    • Alphanumeric or alpha-only words
    • Umlaut conversion (Γ€β†’ae, ΓΆβ†’oe, ΓΌβ†’ue, ΓŸβ†’ss)
    • Word frequency counting
  • Word Grouping: Generate multi-word phrases (e.g., 2-grams, 3-grams)
  • Email & Metadata Extraction: Find emails from content and mailto links, extract meta tags
  • Flexible Output: Save words, emails, and metadata to separate files or stdout
  • Professional CLI: All features accessible via command-line interface with CeWL-compatible flags
  • Silent Operation: Runs quietly by default, with optional debug output
  • Comprehensive Testing: 100% test coverage

πŸ› οΈ Installation

From PyPI (Recommended)

pip install cewlio

From Source

git clone https://github.com/0xCardinal/cewlio
cd cewlio
pip install -e .

Dependencies

  • Python 3.12+
  • Playwright (for browser automation)
  • BeautifulSoup4 (for HTML parsing)
  • Requests (for HTTP handling)

Note: After installing Playwright, you only need to install the chromium-headless-shell browser:

playwright install chromium-headless-shell

⚑ Quick Start

Basic Usage

# Extract words from a website (silent by default)
cewlio https://example.com

# Save words to a file
cewlio https://example.com --output wordlist.txt

# Include emails in stdout output
cewlio https://example.com -e

# Include metadata in stdout output
cewlio https://example.com -a

# Save emails and metadata to files
cewlio https://example.com --email_file emails.txt --meta_file meta.txt

More Examples

Generate word groups with counts:

cewlio https://example.com --groups 3 -c --output phrases.txt

Custom word filtering:

cewlio https://example.com -m 4 --max-length 12 --lowercase --convert-umlauts

Handle JavaScript-heavy sites:

cewlio https://example.com -w 5 --visible

Extract only emails and metadata (no words):

cewlio https://example.com -e -a

Extract only emails (no words):

cewlio https://example.com -e

Extract only metadata (no words):

cewlio https://example.com -a

Save emails to file (no words to stdout):

cewlio https://example.com --email_file emails.txt

πŸŽ›οΈ Command-Line Options

Option Description Default
url URL to process Required
--version Show version and exit -
--output Output file for words stdout
-e, --email Include email addresses in stdout output False
--email_file Output file for email addresses -
-a, --meta Include metadata in stdout output False
--meta_file Output file for metadata -
-m, --min_word_length Minimum word length 3
--max-length Maximum word length No limit
--lowercase Convert words to lowercase False
--with-numbers Include words with numbers False
--convert-umlauts Convert umlaut characters False
-c, --count Show word counts False
--groups Generate word groups of specified size -
-w, --wait Wait time for JavaScript execution (seconds) 0
--visible Show browser window False
--timeout Browser timeout (milliseconds) 30000
--debug Show debug/summary output False

πŸ“š API Usage

Basic Python Usage

from cewlio import CeWLio

# Create instance with custom settings
cewlio = CeWLio(
    min_word_length=4,
    max_word_length=12,
    lowercase=True,
    convert_umlauts=True
)

# Process HTML content
html_content = "<p>Hello world! Contact us at test@example.com</p>"
cewlio.process_html(html_content)

# Access results
print("Words:", list(cewlio.words.keys()))
print("Emails:", list(cewlio.emails))
print("Metadata:", list(cewlio.metadata))

Process URLs

import asyncio
from cewlio import CeWLio, process_url_with_cewlio

async def main():
    cewlio = CeWLio()
    success = await process_url_with_cewlio(
        url="https://example.com",
        cewlio_instance=cewlio,
        wait_time=5,
        headless=True
    )
    
    if success:
        print(f"Found {len(cewlio.words)} words")
        print(f"Found {len(cewlio.emails)} emails")

asyncio.run(main())

πŸ§ͺ Testing

The project includes a comprehensive test suite with 38 tests covering all functionality:

  • βœ… Core functionality tests (15 tests)
  • βœ… HTML extraction tests (3 tests)
  • βœ… URL processing tests (2 tests)
  • βœ… Integration tests (3 tests)
  • βœ… CLI argument validation tests (5 tests)
  • βœ… Edge case tests (10 tests)

Total: 38 tests with 100% success rate

For detailed testing information and development setup, see CONTRIBUTING.md.


πŸ› Troubleshooting

Common Issues

"No module named 'playwright'"

pip install playwright
playwright install chromium-headless-shell

JavaScript-heavy sites not loading properly

# Increase wait time for JavaScript execution
cewlio https://example.com -w 10

Browser timeout errors

# Increase timeout and wait time
cewlio https://example.com --timeout 60000 -w 5

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:

  • πŸš€ Getting started with development
  • πŸ“ Code style and formatting guidelines
  • πŸ§ͺ Testing requirements and procedures
  • πŸ”„ Submitting pull requests
  • πŸ› Reporting issues
  • πŸ’‘ Feature requests

Quick start:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

For detailed development setup and guidelines, see CONTRIBUTING.md.


πŸ™ Credits

  • Inspired by CeWL by Robin Wood

πŸ“ž Support


Made with ❀️ for the security community

About

A powerful Python-based Custom Word List Generator that crawls web pages with JavaScript support to extract words, emails, and metadata for security testing and research.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages