Skip to content

Pet Project. Universal documentation scraper with RAG-powered chatbot. Extracts docs to Markdown, indexes with vector search, provides intelligent Q&A with source links. TypeScript + ChromaDB + OpenAI.

Notifications You must be signed in to change notification settings

AlexSerbinov/doc-scrapper

Repository files navigation

📚 Doc Scrapper AI

Universal documentation scraper with Multi-Collection RAG system and intelligent chat bot

Automatically extracts technical documentation from websites, indexes it into separate vector collections, and provides contextual AI responses with the ability to select specific documentation sources.

🎯 System Architecture

3 independent servers:

Server Port Purpose Status
ChromaDB 8000 Vector DB with multi-collection support
RAG API 8001 HTTP API for AI queries
Web App 3000 Next.js interface with UI

✨ Key Features (UPDATED)

🗂️ Multi-Collection System ⭐ NEW

  • Separate collections for each documentation project
  • Collection Selector UI with project grouping
  • Contextual AI responses from correct sources
  • Dynamic switching between collections without restart

📄 Documentation Consolidation for LLM ⭐ NEW

  • Single file with all documentation for large LLM contexts
  • Google Gemini Flash/Pro support (1M+ tokens)
  • ChatGPT-4.1 compatibility (128K+ tokens)
  • Markdown rendering with syntax highlighting
  • Copy/Download functionality from web interface
  • Token statistics and file size metrics

🔄 Selective Service Restart ⭐ NEW

  • Include mode: Restart only selected services
  • Exclude mode: Restart all except selected
  • Smart cleanup: Clean only relevant ports
  • Service monitoring: Display status of all services
  • Flexible commands: npm run restart:web, npm run restart:backend, etc.

🔍 Intelligent Scraping

  • Automatic page discovery via sitemap.xml or navigation
  • Smart content extraction without navigation, ads, and unnecessary elements
  • Structured file organization according to original site structure
  • Real-time progress tracking through web interface

🤖 RAG System (Retrieval-Augmented Generation)

  • Vector indexing of documents via OpenAI embeddings
  • Multi-collection management with REST API
  • Semantic search for relevant content
  • Intelligent responses using GPT-4o-mini

🌐 Full-Featured Web App ⭐ READY

  • Trial activation form with real-time scraping
  • Collection Selector with documentation grouping
  • Chat Interface with contextual AI responses
  • Progress tracking for scraping operations
  • Consolidation feature for LLM integration

🚀 Quick Start (UPDATED)

🔥 GitHub Launch (Recommended)

# 1. Clone repository
git clone https://github.com/AlexSerbinov/doc-scrapper.git
cd doc-scrapper

# 2. Install dependencies
npm install
cd web-app && npm install && cd ..

# 3. Create .env file
cp config/env-template.txt .env
# Add your OPENAI_API_KEY to .env file

# 4. Build and run (NEW!)
npm run restart  # ⭐ Universal restart of all services

# 5. Open web app
open http://localhost:3000

⚙️ Server Management (EXPANDED)

# === FULL RESTART ===
npm run restart         # ⭐ Universal restart of all services
npm run dev:all         # Alternative launch

# === SELECTIVE RESTART ===
npm run restart:web     # ⭐ NEW: Web app only
npm run restart:rag     # ⭐ NEW: RAG API only
npm run restart:chroma  # ⭐ NEW: ChromaDB only

# === EXCLUDE MODE ===
npm run restart:backend    # ⭐ NEW: Everything except web app
npm run restart:except-web # ⭐ NEW: Everything except web app
npm run restart:except-rag # ⭐ NEW: Everything except RAG API

# === SHUTDOWN ===
npm run stop           # ⭐ Stop everything

# === INDIVIDUAL LAUNCH ===
npm run dev:backend    # ChromaDB + RAG API
npm run web:dev        # Web app only

# === MONITORING ===
npm run health         # RAG API health check
curl http://localhost:8001/collections  # Collection list

📚 Multi-Collection System in Action

Current collections:

📚 Active collections:
├── ai-sdk-dev-docs (6,358 documents)   # AI SDK documentation
├── astro-test (6,216 documents)        # Astro documentation  
└── doc-scrapper-docs (3,178 documents) # Doc Scrapper documentation

How to use:

  1. Open web app: http://localhost:3000
  2. Enter documentation URL in form (or use existing collections)
  3. Wait for completion of scraping + indexing process
  4. Select collection in Collection Selector
  5. Ask questions and receive contextual responses

LLM Consolidation:

  1. Select collection in web interface
  2. Click "Consolidate for LLM" button
  3. View consolidated documentation with markdown rendering
  4. Copy or download for use with large LLMs
  5. Statistics: File count, size, token estimation

REST API for collections:

# List all collections
curl http://localhost:8001/collections

# Query specific collection
curl -X POST http://localhost:8001/query \
  -d '{"message":"How to use AI SDK?", "collectionName":"ai-sdk-dev-docs"}'

# Consolidate collection
curl -X POST http://localhost:8001/consolidate \
  -d '{"collectionName":"ai-sdk-dev-docs", "projectName":"AI SDK"}'

# Switch collection
curl -X POST http://localhost:8001/switch-collection \
  -d '{"collectionName":"astro-test"}'

🌐 Web Interface Features

✅ Ready Features:

  • 🎯 Trial Activation Form - real-time scraping of new documentation
  • 📊 Progress Tracking - live updates during processing
  • 🗂️ Collection Selector - choose between documentation projects
  • 💬 Chat Interface - AI assistant with contextual responses
  • 📄 LLM Consolidation - combine documentation for large LLMs
  • 📋 Copy/Download - convenient copying and downloading
  • 📱 Responsive Design - adaptive design
  • 🔄 Real-time Switching - switch between collections

🚀 Workflow:

URL Input → Progress Bar → Scraping → Indexing → 
Collection Created → Demo Chat → AI Responses → LLM Consolidation

📖 Detailed Usage

CLI Scraping (alternative method)

# Basic scraping via CLI
npm run build  # Compile before use
npm run dev -- "https://docs.example.com"

# With settings
npm run dev -- "https://ai-sdk.dev/docs/introduction" \
  -o ./my-docs \
  -f markdown \
  -m 100 \
  -c 5 \
  -d 1500 \
  --verbose

# IMPORTANT: Use double dash --
npm run dev -- "URL" -c 10 -d 2000  # ✅ CORRECT
npm run dev "URL" -c 10              # ❌ INCORRECT

RAG System Commands

# Index documents
npm run rag:index <path-to-docs> [options]

# CLI chat (alternative to web interface)
npm run rag:chat

# Collection statistics
npm run rag:stats

📁 Project Structure

doc-scrapper/
├── 📜 scripts/             # Operational scripts
│   ├── restart.sh          # Universal service restart
│   ├── stop.sh             # Service shutdown
│   └── clean-collections.* # Collection management
├── 🐳 docker/              # Docker containers
│   ├── docker-compose.yml  # Service orchestration
│   ├── docker-start.sh     # Container startup
│   └── Dockerfile.rag-api  # RAG API container
├── ⚙️ config/              # Configuration files
│   ├── env-template.txt    # Environment template
│   └── .dockerignore       # Docker exclusions
├── 📚 docs/                # Documentation
│   ├── DOCKER_README.md    # Docker deployment guide
│   └── AGENT.md            # AI agent notes
├── 🚀 deployment/          # Production deployment
│   ├── deploy.sh           # Deployment script
│   ├── webhook.js          # GitHub webhook handler
│   └── webhook.service     # Systemd service
├── 📊 logs/                # Application logs
├── 💾 src/                 # Source code
├── 🌐 web-app/             # Next.js frontend
├── 🧠 memory-bank/         # Project context & history
└── 📋 Core files           # README, package.json, etc.

🛠️ Technologies (UPDATED)

Backend Stack

  • TypeScript - Strict typing
  • Node.js v24+ - Runtime
  • ChromaDB - Multi-collection vector DB
  • OpenAI - Embeddings (text-embedding-3-small) + LLM (GPT-4o-mini)
  • AI SDK - Vercel AI SDK v3.0

Frontend Stack ⭐ NEW

  • Next.js 15 - Full-stack framework with App Router
  • TypeScript - Type safety
  • TailwindCSS - Modern styling with dark theme
  • React Server/Client Components - Optimal architecture

DevOps & Scripts ⭐ NEW

  • Universal restart scripts - restart.sh, stop.sh
  • NPM scripts integration - npm run restart, npm run stop
  • Port management - Automatic cleanup of ports 3000, 8000, 8001
  • Process monitoring - Health checks and automatic recovery

🔧 Troubleshooting

Ports occupied:

npm run stop           # Automatically free ports
# or manually:
lsof -ti:3000 | xargs kill -9
lsof -ti:8000 | xargs kill -9  
lsof -ti:8001 | xargs kill -9

ChromaDB issues:

# Check if running
curl http://localhost:8000/api/v1/heartbeat

# Restart ChromaDB
npm run chroma:restart

Web app build errors:

# Next.js build
cd web-app && npm run build

# If utility function errors:
# Check that all utility functions are moved to lib/ folder

📊 Test Results (UPDATED)

Multi-Collection System:

  • 3 active collections with different content
  • Collection-specific queries work correctly
  • UI/UX with expandable selector and real-time switching
  • Web API integration (/api/collections, /api/chat) functional

Performance tests:

  • Form submission → real scraping process
  • Progress tracking → real-time updates
  • RAG indexing → automatic after scraping
  • Chat responses → contextual from correct sources

Test on ai-sdk.dev:

  • 487 URLs found via sitemap.xml
  • 53 seconds scraping time
  • 51 files created (280KB)
  • Collection created automatically with 6,358 documents

🎯 Ready Use Cases

Scenario 1: New User

  1. git clone https://github.com/AlexSerbinov/doc-scrapper.git
  2. npm install && cd web-app && npm install && cd ..
  3. Add OPENAI_API_KEY to .env file
  4. npm run restart
  5. Open http://localhost:3000
  6. Enter documentation URL
  7. Wait for completion and use chat

Scenario 2: Developer

  1. npm run restart - start all services
  2. Work with code
  3. npm run stop - stop everything when finished

Scenario 3: API Testing

# Health check
curl http://localhost:8001/health

# List collections
curl http://localhost:8001/collections | jq .

# Query specific collection
curl -X POST http://localhost:8001/query \
  -d '{"message":"test", "collectionName":"ai-sdk-dev-docs"}'

🚧 Roadmap

✅ Completed:

  • Multi-Collection system with UI
  • Web App with full functionality
  • Universal restart scripts
  • Real-time progress tracking
  • Collection-aware chat interface

🔄 Planned:

  • Authentication system with trial limitations
  • CLI collection parameter (--collection-name)
  • Export to single file functionality
  • Embeddable chat widget for websites
  • Production deployment guides

📄 Environment Setup

Required:

# .env file
OPENAI_API_KEY=sk-your-openai-api-key-here

Optional:

# RAG settings
RAG_LLM_MODEL=gpt-4o-mini
RAG_EMBEDDING_MODEL=text-embedding-3-small
RAG_VECTOR_STORE_CONNECTION_STRING=http://localhost:8000

# Web App settings
NEXT_PUBLIC_RAG_API_URL=http://localhost:8001

🤝 Contributing

  1. Fork the repository: https://github.com/AlexSerbinov/doc-scrapper
  2. Create feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open Pull Request

📄 License

MIT License - see LICENSE file for details.


🔗 Useful Links:

🎯 System ready for production use as multi-collection documentation AI assistant! 🚀

About

Pet Project. Universal documentation scraper with RAG-powered chatbot. Extracts docs to Markdown, indexes with vector search, provides intelligent Q&A with source links. TypeScript + ChromaDB + OpenAI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •