ScrollScribe

CLI toolkit for ML engineers, developers, data scientists, and researchers.

Extract docs to Markdown • Generate rich CSV/JSON metadata • Prepare data for vector databases

With ScrollScribe, you can build your own docs library in minutes. Automatically discover all pages on a documentation site and convert them to clean Markdown files—perfect for agentic workflows, custom search systems, or offline documentation.

The Toolkit

`discover` - URL Extraction + Metadata

Extract URLs with rich metadata (keywords, depth, timestamps) exported as TXT, CSV, or JSON.

`scrape` - Page Processing

Process single pages or URL lists. Choose fast mode (500+ pages/min) or LLM mode (publication-ready Markdown).

`process` - Unified Pipeline

Point and go: discover + scrape in one command. Fast for bulk extraction, LLM for high-quality output.

⚡ Processing Modes

Fast Mode (`--fast`)

Quickly converts large documentation sites—great for bulk extraction, drafts, or when you don’t need perfect formatting. No API key required.

AI Mode (default, or `--no-fast`)

Uses LLMs for the highest quality Markdown output—ideal for publishing, feeding into websites, or when you want perfectly structured docs. Requires an API key and takes longer per page.

What ScrollScribe Does

Discovers URLs from documentation sites with rich metadata (keywords, depth, timestamps) - export as TXT, CSV, or JSON
Processes single pages or entire URL lists - choose fast mode (500+ pages/min) or LLM mode for publication-quality output
Converts HTML to clean Markdown with preserved formatting, code blocks, and working links
Outputs structured data perfect for AI agents, vector databases, or offline documentation

Examples:

scribe discover docs.fastapi.com -o urls.json → Get 200+ URLs with metadata for analysis
scribe process docs.fastapi.com -o fastapi-docs/ → Get 200+ clean Markdown files
scribe scrape single-page.html -o output/ → Process just one page

Installation

git clone https://github.com/your-username/scrollscribe
cd scrollscribe
uv sync  # or pip install -r requirements.txt

Quick Start

Basic Usage

# Convert entire documentation site to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/

# That's it! All pages are now in the fastapi-docs/ folder

Set up API Key (Recommended)

For highest quality output, add your API key:

# Create .env file with your API key
echo "OPENROUTER_API_KEY=your-key-here" > .env

# Now uses best model by default (Codestral 2501)
scribe process https://docs.fastapi.com/ -o fastapi-docs/

Processing Modes

ScrollScribe offers two processing modes depending on your needs:

Feature	Fast Mode	AI Mode
Speed	50-200 pages/minute	10-15 pages/minute
Cost	Free	~$0.005 per page (Codestral)
Quality	Good - removes navigation/ads	Excellent - AI-filtered content
API Key	Not required	Required
Best For	Large sites, quick extraction	High-quality documentation
Default Model	N/A	Codestral 2501

Fast Mode (No API Key Needed)

# Fast processing - no API key required
scribe process https://docs.fastapi.com/ -o fastapi-docs/ --fast

Good for:

Large documentation sites (1000+ pages)
Quick content extraction
When you don't want to pay for API calls

AI Mode (Default with API Key)

# Uses Codestral 2501 by default - best quality
scribe process https://docs.fastapi.com/ -o fastapi-docs/

Good for:

High-quality documentation extraction (default mode)
When clean formatting is important
Feeding into other AI tools

Commands

ScrollScribe has three main commands:

`process` - Complete Pipeline (Most Common)

Convert an entire documentation site in one command:

# Discover all pages and convert them to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/

`discover` - Find All Documentation Pages

Extract URLs from a site with optional metadata (useful for manual curation):

# Get simple list of URLs
scribe discover https://docs.fastapi.com/ -o urls.txt

# Get rich metadata with depth, keywords, and timestamps
scribe discover https://docs.fastapi.com/ -o urls.json

# Get CSV format for spreadsheet analysis
scribe discover https://docs.fastapi.com/ -o urls.csv

Output Formats:

.txt - Simple URL list (default)
.csv - Rich metadata in spreadsheet format with columns for depth, keywords, timestamps, and filenames
.json - Same rich metadata as structured objects for programming

JSON metadata example:

{
  "url": "https://docs.fastapi.com/tutorial/first-steps/",
  "path": "/tutorial/first-steps/",
  "depth": 2,
  "keywords": ["tutorial", "first", "steps"],
  "filename_part": "tutorial/first-steps",
  "discovered_at": "2025-06-24T19:43:27.627987"
}

Why use discover separately?

Manual curation: Edit output files to remove pages you don't want
Planning: See how many pages and site structure before processing
Analysis: Use JSON metadata to understand site hierarchy and content types
Selective processing: Only download the pages you actually need

`scrape` - Convert to Markdown

Process URLs or a single page:

# Process a curated list of URLs
scribe scrape urls.txt -o fastapi-docs/

# Process a single page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/

Smart input detection: scrape automatically detects if you're giving it:

A .txt file with URLs (one per line)
A single webpage URL (http:// or https://)

API Keys & Models

Default Model: openrouter/mistralai/codestral-2501 ⭐ (Best quality)

Alternative Models

openrouter/google/gemini-2.0-flash-exp:free (Free tier)
openrouter/anthropic/claude-3-haiku (Fast premium)

Setting API Key

# Setup: Add API keys to .env file
echo "OPENROUTER_API_KEY=your-openrouter-key" >> .env
echo "ANTHROPIC_API_KEY=your-anthropic-key" >> .env
echo "MISTRAL_API_KEY=your-mistral-key" >> .env

# Use default API key (OPENROUTER_API_KEY)
scribe process https://docs.example.com/ -o output/

# Use a different API key variable
scribe process https://docs.example.com/ -o output/ --api-key-env ANTHROPIC_API_KEY

Changing Models

# Use a different model with its corresponding API key
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/anthropic/claude-3-haiku \
  --api-key-env ANTHROPIC_API_KEY

# Use free model (still needs OpenRouter key)
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/google/gemini-2.0-flash-exp:free

Get a free API key at OpenRouter.

Workflow Examples

Complete Workflow (Most Common)

# One command to rule them all
scribe process https://docs.fastapi.com/ -o fastapi-docs/

Curated Workflow (Manual Selection)

# Step 1: Discover all pages
scribe discover https://docs.fastapi.com/ -o urls.txt

# Step 2: Edit urls.txt - remove pages you don't want
# Step 3: Process only the pages you kept
scribe scrape urls.txt -o fastapi-docs/

Single Page

# Process just one specific page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/

For Developers

Offline Documentation: Work with docs without internet
AI Tools: Feed clean docs into Claude, ChatGPT, or local AI
Documentation Search: Build custom search for your team
Backup: Archive documentation that might change or disappear

For Teams

Internal Knowledge Base: Convert internal wikis to searchable Markdown
Compliance: Archive API documentation for regulatory requirements
Training Data: Clean documentation for training custom models

For Researchers

Literature Review: Convert technical documentation for analysis
Comparative Studies: Analyze documentation across different tools
Academic Research: Study how projects document their APIs

Advanced Usage

Separate Discovery and Processing

# Step 1: Discover all URLs (fast)
scribe discover https://docs.fastapi.com/ -o urls.txt

# Step 2: Process URLs to Markdown
scribe scrape urls.txt -o fastapi-docs/

Resume Processing

# Resume from the 50th page or URL #50 if processing was interrupted
scribe scrape urls.txt -o output/ --start-at 50

Custom Settings

# Use different model with custom timeout
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/anthropic/claude-3-haiku \
  --timeout 120000 \
  --verbose

# Use different API key variable
scribe process https://docs.example.com/ -o output/ \
  --api-key-env ANTHROPIC_API_KEY

# Combine custom model and API key variable
scribe process https://docs.example.com/ -o output/ \
  --model openrouter/mistralai/codestral-2501 \
  --api-key-env OPENROUTER_API_KEY \
  --verbose

Output Structure

ScrollScribe saves one Markdown file per documentation page in the output folder you specify. You choose the folder name—organize by language, project, or however you like.

scribe process https://docs.python.org/3/ -o python-docs/
scribe process https://developer.mozilla.org/en-US/docs/Web/JavaScript -o javascript-docs/

output/
├── python-docs/
│   ├── index.md                # Homepage
│   ├── getting-started.md      # Getting started guide
│   ├── ...                     # Other pages
├── javascript-docs/
    ├── index.md
    └── ...

Each file contains:

Clean Markdown formatting
Preserved code blocks and syntax highlighting
Working internal links (converted to relative paths)
Original page title as the filename

This flexible structure makes it easy to build your own docs library, organize by project or language, and prepare for future features like serving docs with an MCP server.

Large Sites (Use Fast Mode)

# Large documentation sites - use fast mode for speed
scribe process https://docs.microsoft.com/en-us/azure/ -o azure-docs/ --fast
scribe process https://developer.mozilla.org/en-US/docs/ -o mdn-docs/ --fast

Troubleshooting

"API key not found"

Create a .env file with your OpenRouter API key:

echo "OPENROUTER_API_KEY=your-key-here" > .env

"Rate limit error"

ScrollScribe automatically retries with backoff. For persistent issues:

Try the free models first
Use --fast mode to avoid API calls entirely

"Some pages failed"

Some sites block automated access. ScrollScribe will:

Show which URLs failed
Continue processing other pages
Let you retry failed URLs later

Site-specific issues

# Increase timeout for slow sites
scribe process https://slow-site.com/ -o output/ --timeout 120000

# Use verbose mode to see what's happening
scribe process https://site.com/ -o output/ --verbose

What's Different About ScrollScribe

Unlike simple web scrapers, ScrollScribe:

Understands documentation structure - follows internal links intelligently
Cleans content - removes navigation, ads, and irrelevant elements
Preserves formatting - maintains code blocks, headers, and structure
Handles modern sites - works with JavaScript-heavy documentation
Scales efficiently - processes hundreds of pages reliably

Contributing

Found a bug or want to add a feature?

Open an issue describing the problem
Fork the repository
Make your changes
Submit a pull request

Building & Publishing

This project uses Hatch for building and publishing. Contributors should have it installed.

License

MIT License - use ScrollScribe for any purpose, commercial or personal.

ScrollScribe - Turn any documentation site into clean Markdown files or structured metadata for AI processing.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
app		app
data		data
docs		docs
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
PROMPT.md		PROMPT.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
runscript.sh		runscript.sh
uv.lock		uv.lock
zed-cheatsheet.md		zed-cheatsheet.md

License

JamesN-dev/Scroll-Scribe

Folders and files

Latest commit

History

Repository files navigation

ScrollScribe

The Toolkit

discover - URL Extraction + Metadata

scrape - Page Processing

process - Unified Pipeline

⚡ Processing Modes

Fast Mode (--fast)

AI Mode (default, or --no-fast)

What ScrollScribe Does

Installation

Quick Start

Basic Usage

Set up API Key (Recommended)

Processing Modes

Fast Mode (No API Key Needed)

AI Mode (Default with API Key)

Commands

process - Complete Pipeline (Most Common)

discover - Find All Documentation Pages

scrape - Convert to Markdown

API Keys & Models

Alternative Models

Setting API Key

Changing Models

Workflow Examples

Complete Workflow (Most Common)

Curated Workflow (Manual Selection)

Single Page

For Developers

For Teams

For Researchers

Advanced Usage

Separate Discovery and Processing

Resume Processing

Custom Settings

Output Structure

Large Sites (Use Fast Mode)

Troubleshooting

"API key not found"

"Rate limit error"

"Some pages failed"

Site-specific issues

What's Different About ScrollScribe

Contributing

Building & Publishing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`discover` - URL Extraction + Metadata

`scrape` - Page Processing

`process` - Unified Pipeline

Fast Mode (`--fast`)

AI Mode (default, or `--no-fast`)

`process` - Complete Pipeline (Most Common)

`discover` - Find All Documentation Pages

`scrape` - Convert to Markdown

Packages