SmartCrawl

Async web crawler for discovering URLs, downloading HTML from URLs and processing and downloading documents. Processes PDFs, DOCX and other formats using Docling, exporting to Markdown/JSON/HTML for chatbot and RAG applications.

Setup

Install dependencies:
```
pip install -r requirements.txt
```
Install Playwright browsers (required for JavaScript page rendering):
```
playwright install
```
Create a .env file in the project root (if it doesn't exist):
```
CONFIG_PATH=app/config/files/config.yaml
```

Running the Project

Basic Usage

Run URL discovery only:

python scripts/run_orchestrator.py

Or specify a URL:

python scripts/run_orchestrator.py https://example.com

With Document Download

To also download documents (PDFs, Word docs, etc.) after discovery:

python scripts/run_orchestrator.py --download

Or with a custom URL:

python scripts/run_orchestrator.py https://example.com --download

JavaScript Download Manager Support: The downloader automatically handles sites using JavaScript-based download managers:

WordPress Download Manager
Easy Digital Downloads
Better File Download
Any site with download buttons (not direct links)

It uses Playwright to click download buttons and capture the files. This is enabled by default. Set click_download_buttons: false in document_sweeping.yaml to disable.

With HTML Content Saving

Save raw HTML and processed (cleaned) content:

python scripts/run_orchestrator.py --save-html

This will:

Save raw HTML files to html_output/raw_html/
Save processed text to html_output/processed/
Save metadata (title, description, author, date) to html_output/metadata/

Extraction modes (set in html_saving.yaml):

extraction_mode: "full_text" (default) - Extract ALL visible text from page
extraction_mode: "main_content" - Extract only article content (removes nav/ads/boilerplate)

With JavaScript Rendering (Dynamic Sites)

For sites that load content via JavaScript (React, Vue, Angular, etc.), use Playwright to render the full DOM:

python scripts/run_orchestrator.py --save-html --use-playwright

This will:

Use a headless Chrome browser to render pages
Execute JavaScript and wait for dynamic content
Scroll pages to trigger lazy loading
Capture the fully-rendered HTML (not just the initial response)

Note: Playwright mode is slower but necessary for:

Single Page Applications (SPAs)
Sites with infinite scroll
Content loaded via AJAX/fetch
Lazy-loaded images and text

Combine all options:

python scripts/run_orchestrator.py https://example.com --save-html --use-playwright --download

Configuration

Test URL: Edit app/config/files/test.yaml to change the default target URL
URL Discovery: Edit app/config/files/url_discovery.yaml for crawling settings
Document Sweeping: Edit app/config/files/document_sweeping.yaml for download settings
HTML Saving: Edit app/config/files/html_saving.yaml for HTML content extraction settings

Output

Discovered URLs are logged to the console
Downloaded documents are saved to the downloads/ directory (configurable in document_sweeping.yaml)
HTML content is saved to html_output/ directory with subdirectories for raw HTML, processed text, and metadata

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartCrawl

Setup

Running the Project

Basic Usage

With Document Download

With HTML Content Saving

With JavaScript Rendering (Dynamic Sites)

Configuration

Output

About

Uh oh!

Releases

Packages

Languages

KikoTheFinker/SmartCrawl

Folders and files

Latest commit

History

Repository files navigation

SmartCrawl

Setup

Running the Project

Basic Usage

With Document Download

With HTML Content Saving

With JavaScript Rendering (Dynamic Sites)

Configuration

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages