Async web crawler for discovering URLs, downloading HTML from URLs and processing and downloading documents. Processes PDFs, DOCX and other formats using Docling, exporting to Markdown/JSON/HTML for chatbot and RAG applications.
-
Install dependencies:
pip install -r requirements.txt -
Install Playwright browsers (required for JavaScript page rendering):
playwright install
-
Create a
.envfile in the project root (if it doesn't exist):CONFIG_PATH=app/config/files/config.yaml
Run URL discovery only:
python scripts/run_orchestrator.pyOr specify a URL:
python scripts/run_orchestrator.py https://example.comTo also download documents (PDFs, Word docs, etc.) after discovery:
python scripts/run_orchestrator.py --downloadOr with a custom URL:
python scripts/run_orchestrator.py https://example.com --downloadJavaScript Download Manager Support: The downloader automatically handles sites using JavaScript-based download managers:
- WordPress Download Manager
- Easy Digital Downloads
- Better File Download
- Any site with download buttons (not direct links)
It uses Playwright to click download buttons and capture the files. This is enabled by default.
Set click_download_buttons: false in document_sweeping.yaml to disable.
Save raw HTML and processed (cleaned) content:
python scripts/run_orchestrator.py --save-htmlThis will:
- Save raw HTML files to
html_output/raw_html/ - Save processed text to
html_output/processed/ - Save metadata (title, description, author, date) to
html_output/metadata/
Extraction modes (set in html_saving.yaml):
extraction_mode: "full_text"(default) - Extract ALL visible text from pageextraction_mode: "main_content"- Extract only article content (removes nav/ads/boilerplate)
For sites that load content via JavaScript (React, Vue, Angular, etc.), use Playwright to render the full DOM:
python scripts/run_orchestrator.py --save-html --use-playwrightThis will:
- Use a headless Chrome browser to render pages
- Execute JavaScript and wait for dynamic content
- Scroll pages to trigger lazy loading
- Capture the fully-rendered HTML (not just the initial response)
Note: Playwright mode is slower but necessary for:
- Single Page Applications (SPAs)
- Sites with infinite scroll
- Content loaded via AJAX/fetch
- Lazy-loaded images and text
Combine all options:
python scripts/run_orchestrator.py https://example.com --save-html --use-playwright --download- Test URL: Edit
app/config/files/test.yamlto change the default target URL - URL Discovery: Edit
app/config/files/url_discovery.yamlfor crawling settings - Document Sweeping: Edit
app/config/files/document_sweeping.yamlfor download settings - HTML Saving: Edit
app/config/files/html_saving.yamlfor HTML content extraction settings
- Discovered URLs are logged to the console
- Downloaded documents are saved to the
downloads/directory (configurable indocument_sweeping.yaml) - HTML content is saved to
html_output/directory with subdirectories for raw HTML, processed text, and metadata