Skip to content

ScrollScribe is a Python CLI toolkit that grab docs or index website pages and converts them into clean local Markdown using browser automation and LLM filtering, perfect for building RAG datasets.

License

Notifications You must be signed in to change notification settings

JamesN-dev/Scroll-Scribe

Repository files navigation

ScrollScribe Logo

ScrollScribe

CLI toolkit for ML engineers, developers, data scientists, and researchers.

Extract docs to Markdown • Generate rich CSV/JSON metadata • Prepare data for vector databases

Powered by Crawl4AI Built with Typer Python 3.10+ MIT License

Toolkit | Features | Installation | Quick Start | Processing Modes | Commands | FAQ


With ScrollScribe, you can build your own docs library in minutes. Automatically discover all pages on a documentation site and convert them to clean Markdown files—perfect for agentic workflows, custom search systems, or offline documentation.


The Toolkit

discover - URL Extraction + Metadata

Extract URLs with rich metadata (keywords, depth, timestamps) exported as TXT, CSV, or JSON.

scrape - Page Processing

Process single pages or URL lists. Choose fast mode (500+ pages/min) or LLM mode (publication-ready Markdown).

process - Unified Pipeline

Point and go: discover + scrape in one command. Fast for bulk extraction, LLM for high-quality output.


⚡ Processing Modes

Fast Mode (--fast)

  • Quickly converts large documentation sites—great for bulk extraction, drafts, or when you don’t need perfect formatting. No API key required.

AI Mode (default, or --no-fast)

  • Uses LLMs for the highest quality Markdown output—ideal for publishing, feeding into websites, or when you want perfectly structured docs. Requires an API key and takes longer per page.

What ScrollScribe Does

  1. Discovers URLs from documentation sites with rich metadata (keywords, depth, timestamps) - export as TXT, CSV, or JSON
  2. Processes single pages or entire URL lists - choose fast mode (500+ pages/min) or LLM mode for publication-quality output
  3. Converts HTML to clean Markdown with preserved formatting, code blocks, and working links
  4. Outputs structured data perfect for AI agents, vector databases, or offline documentation

Examples:

  • scribe discover docs.fastapi.com -o urls.json → Get 200+ URLs with metadata for analysis
  • scribe process docs.fastapi.com -o fastapi-docs/ → Get 200+ clean Markdown files
  • scribe scrape single-page.html -o output/ → Process just one page

Installation

git clone https://github.com/your-username/scrollscribe
cd scrollscribe
uv sync  # or pip install -r requirements.txt

Quick Start

Basic Usage

# Convert entire documentation site to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/

# That's it! All pages are now in the fastapi-docs/ folder

Set up API Key (Recommended)

For highest quality output, add your API key:

# Create .env file with your API key
echo "OPENROUTER_API_KEY=your-key-here" > .env

# Now uses best model by default (Codestral 2501)
scribe process https://docs.fastapi.com/ -o fastapi-docs/

Processing Modes

ScrollScribe offers two processing modes depending on your needs:

Feature Fast Mode AI Mode
Speed 50-200 pages/minute 10-15 pages/minute
Cost Free ~$0.005 per page (Codestral)
Quality Good - removes navigation/ads Excellent - AI-filtered content
API Key Not required Required
Best For Large sites, quick extraction High-quality documentation
Default Model N/A Codestral 2501

Fast Mode (No API Key Needed)

# Fast processing - no API key required
scribe process https://docs.fastapi.com/ -o fastapi-docs/ --fast

Good for:

  • Large documentation sites (1000+ pages)
  • Quick content extraction
  • When you don't want to pay for API calls

AI Mode (Default with API Key)

# Uses Codestral 2501 by default - best quality
scribe process https://docs.fastapi.com/ -o fastapi-docs/

Good for:

  • High-quality documentation extraction (default mode)
  • When clean formatting is important
  • Feeding into other AI tools

Commands

ScrollScribe has three main commands:

process - Complete Pipeline (Most Common)

Convert an entire documentation site in one command:

# Discover all pages and convert them to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/

discover - Find All Documentation Pages

Extract URLs from a site with optional metadata (useful for manual curation):

# Get simple list of URLs
scribe discover https://docs.fastapi.com/ -o urls.txt

# Get rich metadata with depth, keywords, and timestamps
scribe discover https://docs.fastapi.com/ -o urls.json

# Get CSV format for spreadsheet analysis
scribe discover https://docs.fastapi.com/ -o urls.csv

Output Formats:

  • .txt - Simple URL list (default)
  • .csv - Rich metadata in spreadsheet format with columns for depth, keywords, timestamps, and filenames
  • .json - Same rich metadata as structured objects for programming

JSON metadata example:

{
  "url": "https://docs.fastapi.com/tutorial/first-steps/",
  "path": "/tutorial/first-steps/",
  "depth": 2,
  "keywords": ["tutorial", "first", "steps"],
  "filename_part": "tutorial/first-steps",
  "discovered_at": "2025-06-24T19:43:27.627987"
}

Why use discover separately?

  • Manual curation: Edit output files to remove pages you don't want
  • Planning: See how many pages and site structure before processing
  • Analysis: Use JSON metadata to understand site hierarchy and content types
  • Selective processing: Only download the pages you actually need

scrape - Convert to Markdown

Process URLs or a single page:

# Process a curated list of URLs
scribe scrape urls.txt -o fastapi-docs/

# Process a single page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/

Smart input detection: scrape automatically detects if you're giving it:

  • A .txt file with URLs (one per line)
  • A single webpage URL (http:// or https://)

API Keys & Models

Default Model: openrouter/mistralai/codestral-2501 ⭐ (Best quality)

Alternative Models

  • openrouter/google/gemini-2.0-flash-exp:free (Free tier)
  • openrouter/anthropic/claude-3-haiku (Fast premium)

Setting API Key

# Setup: Add API keys to .env file
echo "OPENROUTER_API_KEY=your-openrouter-key" >> .env
echo "ANTHROPIC_API_KEY=your-anthropic-key" >> .env
echo "MISTRAL_API_KEY=your-mistral-key" >> .env

# Use default API key (OPENROUTER_API_KEY)
scribe process https://docs.example.com/ -o output/

# Use a different API key variable
scribe process https://docs.example.com/ -o output/ --api-key-env ANTHROPIC_API_KEY

Changing Models

# Use a different model with its corresponding API key
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/anthropic/claude-3-haiku \
  --api-key-env ANTHROPIC_API_KEY

# Use free model (still needs OpenRouter key)
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/google/gemini-2.0-flash-exp:free

Get a free API key at OpenRouter.

Workflow Examples

Complete Workflow (Most Common)

# One command to rule them all
scribe process https://docs.fastapi.com/ -o fastapi-docs/

Curated Workflow (Manual Selection)

# Step 1: Discover all pages
scribe discover https://docs.fastapi.com/ -o urls.txt

# Step 2: Edit urls.txt - remove pages you don't want
# Step 3: Process only the pages you kept
scribe scrape urls.txt -o fastapi-docs/

Single Page

# Process just one specific page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/

For Developers

  • Offline Documentation: Work with docs without internet
  • AI Tools: Feed clean docs into Claude, ChatGPT, or local AI
  • Documentation Search: Build custom search for your team
  • Backup: Archive documentation that might change or disappear

For Teams

  • Internal Knowledge Base: Convert internal wikis to searchable Markdown
  • Compliance: Archive API documentation for regulatory requirements
  • Training Data: Clean documentation for training custom models

For Researchers

  • Literature Review: Convert technical documentation for analysis
  • Comparative Studies: Analyze documentation across different tools
  • Academic Research: Study how projects document their APIs

Advanced Usage

Separate Discovery and Processing

# Step 1: Discover all URLs (fast)
scribe discover https://docs.fastapi.com/ -o urls.txt

# Step 2: Process URLs to Markdown
scribe scrape urls.txt -o fastapi-docs/

Resume Processing

# Resume from the 50th page or URL #50 if processing was interrupted
scribe scrape urls.txt -o output/ --start-at 50

Custom Settings

# Use different model with custom timeout
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/anthropic/claude-3-haiku \
  --timeout 120000 \
  --verbose

# Use different API key variable
scribe process https://docs.example.com/ -o output/ \
  --api-key-env ANTHROPIC_API_KEY

# Combine custom model and API key variable
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/mistralai/codestral-2501 \
  --api-key-env OPENROUTER_API_KEY \
  --verbose

Output Structure

ScrollScribe saves one Markdown file per documentation page in the output folder you specify. You choose the folder name—organize by language, project, or however you like.

scribe process https://docs.python.org/3/ -o python-docs/
scribe process https://developer.mozilla.org/en-US/docs/Web/JavaScript -o javascript-docs/
output/
├── python-docs/
│   ├── index.md                # Homepage
│   ├── getting-started.md      # Getting started guide
│   ├── ...                     # Other pages
├── javascript-docs/
    ├── index.md
    └── ...

Each file contains:

  • Clean Markdown formatting
  • Preserved code blocks and syntax highlighting
  • Working internal links (converted to relative paths)
  • Original page title as the filename

This flexible structure makes it easy to build your own docs library, organize by project or language, and prepare for future features like serving docs with an MCP server.

Large Sites (Use Fast Mode)

# Large documentation sites - use fast mode for speed
scribe process https://docs.microsoft.com/en-us/azure/ -o azure-docs/ --fast
scribe process https://developer.mozilla.org/en-US/docs/ -o mdn-docs/ --fast

Troubleshooting

"API key not found"

Create a .env file with your OpenRouter API key:

echo "OPENROUTER_API_KEY=your-key-here" > .env

"Rate limit error"

ScrollScribe automatically retries with backoff. For persistent issues:

  • Try the free models first
  • Use --fast mode to avoid API calls entirely

"Some pages failed"

Some sites block automated access. ScrollScribe will:

  • Show which URLs failed
  • Continue processing other pages
  • Let you retry failed URLs later

Site-specific issues

# Increase timeout for slow sites
scribe process https://slow-site.com/ -o output/ --timeout 120000

# Use verbose mode to see what's happening
scribe process https://site.com/ -o output/ --verbose

What's Different About ScrollScribe

Unlike simple web scrapers, ScrollScribe:

  • Understands documentation structure - follows internal links intelligently
  • Cleans content - removes navigation, ads, and irrelevant elements
  • Preserves formatting - maintains code blocks, headers, and structure
  • Handles modern sites - works with JavaScript-heavy documentation
  • Scales efficiently - processes hundreds of pages reliably

Contributing

Found a bug or want to add a feature?

  1. Open an issue describing the problem
  2. Fork the repository
  3. Make your changes
  4. Submit a pull request

Building & Publishing

This project uses Hatch for building and publishing. Contributors should have it installed.

License

MIT License - use ScrollScribe for any purpose, commercial or personal.


ScrollScribe - Turn any documentation site into clean Markdown files or structured metadata for AI processing.

About

ScrollScribe is a Python CLI toolkit that grab docs or index website pages and converts them into clean local Markdown using browser automation and LLM filtering, perfect for building RAG datasets.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published