CLI toolkit for ML engineers, developers, data scientists, and researchers.
Extract docs to Markdown • Generate rich CSV/JSON metadata • Prepare data for vector databases
Toolkit | Features | Installation | Quick Start | Processing Modes | Commands | FAQ
With ScrollScribe, you can build your own docs library in minutes. Automatically discover all pages on a documentation site and convert them to clean Markdown files—perfect for agentic workflows, custom search systems, or offline documentation.
Extract URLs with rich metadata (keywords, depth, timestamps) exported as TXT, CSV, or JSON.
Process single pages or URL lists. Choose fast mode (500+ pages/min) or LLM mode (publication-ready Markdown).
Point and go: discover + scrape in one command. Fast for bulk extraction, LLM for high-quality output.
- Quickly converts large documentation sites—great for bulk extraction, drafts, or when you don’t need perfect formatting. No API key required.
- Uses LLMs for the highest quality Markdown output—ideal for publishing, feeding into websites, or when you want perfectly structured docs. Requires an API key and takes longer per page.
- Discovers URLs from documentation sites with rich metadata (keywords, depth, timestamps) - export as TXT, CSV, or JSON
- Processes single pages or entire URL lists - choose fast mode (500+ pages/min) or LLM mode for publication-quality output
- Converts HTML to clean Markdown with preserved formatting, code blocks, and working links
- Outputs structured data perfect for AI agents, vector databases, or offline documentation
Examples:
scribe discover docs.fastapi.com -o urls.json→ Get 200+ URLs with metadata for analysisscribe process docs.fastapi.com -o fastapi-docs/→ Get 200+ clean Markdown filesscribe scrape single-page.html -o output/→ Process just one page
git clone https://github.com/your-username/scrollscribe
cd scrollscribe
uv sync # or pip install -r requirements.txt# Convert entire documentation site to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/
# That's it! All pages are now in the fastapi-docs/ folderFor highest quality output, add your API key:
# Create .env file with your API key
echo "OPENROUTER_API_KEY=your-key-here" > .env
# Now uses best model by default (Codestral 2501)
scribe process https://docs.fastapi.com/ -o fastapi-docs/ScrollScribe offers two processing modes depending on your needs:
| Feature | Fast Mode | AI Mode |
|---|---|---|
| Speed | 50-200 pages/minute | 10-15 pages/minute |
| Cost | Free | ~$0.005 per page (Codestral) |
| Quality | Good - removes navigation/ads | Excellent - AI-filtered content |
| API Key | Not required | Required |
| Best For | Large sites, quick extraction | High-quality documentation |
| Default Model | N/A | Codestral 2501 |
# Fast processing - no API key required
scribe process https://docs.fastapi.com/ -o fastapi-docs/ --fastGood for:
- Large documentation sites (1000+ pages)
- Quick content extraction
- When you don't want to pay for API calls
# Uses Codestral 2501 by default - best quality
scribe process https://docs.fastapi.com/ -o fastapi-docs/Good for:
- High-quality documentation extraction (default mode)
- When clean formatting is important
- Feeding into other AI tools
ScrollScribe has three main commands:
Convert an entire documentation site in one command:
# Discover all pages and convert them to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/Extract URLs from a site with optional metadata (useful for manual curation):
# Get simple list of URLs
scribe discover https://docs.fastapi.com/ -o urls.txt
# Get rich metadata with depth, keywords, and timestamps
scribe discover https://docs.fastapi.com/ -o urls.json
# Get CSV format for spreadsheet analysis
scribe discover https://docs.fastapi.com/ -o urls.csvOutput Formats:
.txt- Simple URL list (default).csv- Rich metadata in spreadsheet format with columns for depth, keywords, timestamps, and filenames.json- Same rich metadata as structured objects for programming
JSON metadata example:
{
"url": "https://docs.fastapi.com/tutorial/first-steps/",
"path": "/tutorial/first-steps/",
"depth": 2,
"keywords": ["tutorial", "first", "steps"],
"filename_part": "tutorial/first-steps",
"discovered_at": "2025-06-24T19:43:27.627987"
}Why use discover separately?
- Manual curation: Edit output files to remove pages you don't want
- Planning: See how many pages and site structure before processing
- Analysis: Use JSON metadata to understand site hierarchy and content types
- Selective processing: Only download the pages you actually need
Process URLs or a single page:
# Process a curated list of URLs
scribe scrape urls.txt -o fastapi-docs/
# Process a single page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/Smart input detection: scrape automatically detects if you're giving it:
- A
.txtfile with URLs (one per line) - A single webpage URL (
http://orhttps://)
Default Model: openrouter/mistralai/codestral-2501 ⭐ (Best quality)
openrouter/google/gemini-2.0-flash-exp:free(Free tier)openrouter/anthropic/claude-3-haiku(Fast premium)
# Setup: Add API keys to .env file
echo "OPENROUTER_API_KEY=your-openrouter-key" >> .env
echo "ANTHROPIC_API_KEY=your-anthropic-key" >> .env
echo "MISTRAL_API_KEY=your-mistral-key" >> .env
# Use default API key (OPENROUTER_API_KEY)
scribe process https://docs.example.com/ -o output/
# Use a different API key variable
scribe process https://docs.example.com/ -o output/ --api-key-env ANTHROPIC_API_KEY# Use a different model with its corresponding API key
scribe process https://docs.example.com/ -o output/ \
--model openrouter/anthropic/claude-3-haiku \
--api-key-env ANTHROPIC_API_KEY
# Use free model (still needs OpenRouter key)
scribe process https://docs.example.com/ -o output/ \
--model openrouter/google/gemini-2.0-flash-exp:freeGet a free API key at OpenRouter.
# One command to rule them all
scribe process https://docs.fastapi.com/ -o fastapi-docs/# Step 1: Discover all pages
scribe discover https://docs.fastapi.com/ -o urls.txt
# Step 2: Edit urls.txt - remove pages you don't want
# Step 3: Process only the pages you kept
scribe scrape urls.txt -o fastapi-docs/# Process just one specific page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/- Offline Documentation: Work with docs without internet
- AI Tools: Feed clean docs into Claude, ChatGPT, or local AI
- Documentation Search: Build custom search for your team
- Backup: Archive documentation that might change or disappear
- Internal Knowledge Base: Convert internal wikis to searchable Markdown
- Compliance: Archive API documentation for regulatory requirements
- Training Data: Clean documentation for training custom models
- Literature Review: Convert technical documentation for analysis
- Comparative Studies: Analyze documentation across different tools
- Academic Research: Study how projects document their APIs
# Step 1: Discover all URLs (fast)
scribe discover https://docs.fastapi.com/ -o urls.txt
# Step 2: Process URLs to Markdown
scribe scrape urls.txt -o fastapi-docs/# Resume from the 50th page or URL #50 if processing was interrupted
scribe scrape urls.txt -o output/ --start-at 50# Use different model with custom timeout
scribe process https://docs.example.com/ -o output/ \
--model openrouter/anthropic/claude-3-haiku \
--timeout 120000 \
--verbose
# Use different API key variable
scribe process https://docs.example.com/ -o output/ \
--api-key-env ANTHROPIC_API_KEY
# Combine custom model and API key variable
scribe process https://docs.example.com/ -o output/ \
--model openrouter/mistralai/codestral-2501 \
--api-key-env OPENROUTER_API_KEY \
--verboseScrollScribe saves one Markdown file per documentation page in the output folder you specify. You choose the folder name—organize by language, project, or however you like.
scribe process https://docs.python.org/3/ -o python-docs/
scribe process https://developer.mozilla.org/en-US/docs/Web/JavaScript -o javascript-docs/output/
├── python-docs/
│ ├── index.md # Homepage
│ ├── getting-started.md # Getting started guide
│ ├── ... # Other pages
├── javascript-docs/
├── index.md
└── ...
Each file contains:
- Clean Markdown formatting
- Preserved code blocks and syntax highlighting
- Working internal links (converted to relative paths)
- Original page title as the filename
This flexible structure makes it easy to build your own docs library, organize by project or language, and prepare for future features like serving docs with an MCP server.
# Large documentation sites - use fast mode for speed
scribe process https://docs.microsoft.com/en-us/azure/ -o azure-docs/ --fast
scribe process https://developer.mozilla.org/en-US/docs/ -o mdn-docs/ --fastCreate a .env file with your OpenRouter API key:
echo "OPENROUTER_API_KEY=your-key-here" > .envScrollScribe automatically retries with backoff. For persistent issues:
- Try the free models first
- Use
--fastmode to avoid API calls entirely
Some sites block automated access. ScrollScribe will:
- Show which URLs failed
- Continue processing other pages
- Let you retry failed URLs later
# Increase timeout for slow sites
scribe process https://slow-site.com/ -o output/ --timeout 120000
# Use verbose mode to see what's happening
scribe process https://site.com/ -o output/ --verboseUnlike simple web scrapers, ScrollScribe:
- Understands documentation structure - follows internal links intelligently
- Cleans content - removes navigation, ads, and irrelevant elements
- Preserves formatting - maintains code blocks, headers, and structure
- Handles modern sites - works with JavaScript-heavy documentation
- Scales efficiently - processes hundreds of pages reliably
Found a bug or want to add a feature?
- Open an issue describing the problem
- Fork the repository
- Make your changes
- Submit a pull request
This project uses Hatch for building and publishing. Contributors should have it installed.
MIT License - use ScrollScribe for any purpose, commercial or personal.
ScrollScribe - Turn any documentation site into clean Markdown files or structured metadata for AI processing.