Universal documentation scraper with Multi-Collection RAG system and intelligent chat bot
Automatically extracts technical documentation from websites, indexes it into separate vector collections, and provides contextual AI responses with the ability to select specific documentation sources.
| Server | Port | Purpose | Status |
|---|---|---|---|
| ChromaDB | 8000 | Vector DB with multi-collection support | ✅ |
| RAG API | 8001 | HTTP API for AI queries | ✅ |
| Web App | 3000 | Next.js interface with UI | ✅ |
- Separate collections for each documentation project
- Collection Selector UI with project grouping
- Contextual AI responses from correct sources
- Dynamic switching between collections without restart
- Single file with all documentation for large LLM contexts
- Google Gemini Flash/Pro support (1M+ tokens)
- ChatGPT-4.1 compatibility (128K+ tokens)
- Markdown rendering with syntax highlighting
- Copy/Download functionality from web interface
- Token statistics and file size metrics
- Include mode: Restart only selected services
- Exclude mode: Restart all except selected
- Smart cleanup: Clean only relevant ports
- Service monitoring: Display status of all services
- Flexible commands:
npm run restart:web,npm run restart:backend, etc.
- Automatic page discovery via sitemap.xml or navigation
- Smart content extraction without navigation, ads, and unnecessary elements
- Structured file organization according to original site structure
- Real-time progress tracking through web interface
- Vector indexing of documents via OpenAI embeddings
- Multi-collection management with REST API
- Semantic search for relevant content
- Intelligent responses using GPT-4o-mini
- Trial activation form with real-time scraping
- Collection Selector with documentation grouping
- Chat Interface with contextual AI responses
- Progress tracking for scraping operations
- Consolidation feature for LLM integration
# 1. Clone repository
git clone https://github.com/AlexSerbinov/doc-scrapper.git
cd doc-scrapper
# 2. Install dependencies
npm install
cd web-app && npm install && cd ..
# 3. Create .env file
cp config/env-template.txt .env
# Add your OPENAI_API_KEY to .env file
# 4. Build and run (NEW!)
npm run restart # ⭐ Universal restart of all services
# 5. Open web app
open http://localhost:3000# === FULL RESTART ===
npm run restart # ⭐ Universal restart of all services
npm run dev:all # Alternative launch
# === SELECTIVE RESTART ===
npm run restart:web # ⭐ NEW: Web app only
npm run restart:rag # ⭐ NEW: RAG API only
npm run restart:chroma # ⭐ NEW: ChromaDB only
# === EXCLUDE MODE ===
npm run restart:backend # ⭐ NEW: Everything except web app
npm run restart:except-web # ⭐ NEW: Everything except web app
npm run restart:except-rag # ⭐ NEW: Everything except RAG API
# === SHUTDOWN ===
npm run stop # ⭐ Stop everything
# === INDIVIDUAL LAUNCH ===
npm run dev:backend # ChromaDB + RAG API
npm run web:dev # Web app only
# === MONITORING ===
npm run health # RAG API health check
curl http://localhost:8001/collections # Collection list📚 Active collections:
├── ai-sdk-dev-docs (6,358 documents) # AI SDK documentation
├── astro-test (6,216 documents) # Astro documentation
└── doc-scrapper-docs (3,178 documents) # Doc Scrapper documentation
- Open web app: http://localhost:3000
- Enter documentation URL in form (or use existing collections)
- Wait for completion of scraping + indexing process
- Select collection in Collection Selector
- Ask questions and receive contextual responses
- Select collection in web interface
- Click "Consolidate for LLM" button
- View consolidated documentation with markdown rendering
- Copy or download for use with large LLMs
- Statistics: File count, size, token estimation
# List all collections
curl http://localhost:8001/collections
# Query specific collection
curl -X POST http://localhost:8001/query \
-d '{"message":"How to use AI SDK?", "collectionName":"ai-sdk-dev-docs"}'
# Consolidate collection
curl -X POST http://localhost:8001/consolidate \
-d '{"collectionName":"ai-sdk-dev-docs", "projectName":"AI SDK"}'
# Switch collection
curl -X POST http://localhost:8001/switch-collection \
-d '{"collectionName":"astro-test"}'- 🎯 Trial Activation Form - real-time scraping of new documentation
- 📊 Progress Tracking - live updates during processing
- 🗂️ Collection Selector - choose between documentation projects
- 💬 Chat Interface - AI assistant with contextual responses
- 📄 LLM Consolidation - combine documentation for large LLMs
- 📋 Copy/Download - convenient copying and downloading
- 📱 Responsive Design - adaptive design
- 🔄 Real-time Switching - switch between collections
URL Input → Progress Bar → Scraping → Indexing →
Collection Created → Demo Chat → AI Responses → LLM Consolidation
# Basic scraping via CLI
npm run build # Compile before use
npm run dev -- "https://docs.example.com"
# With settings
npm run dev -- "https://ai-sdk.dev/docs/introduction" \
-o ./my-docs \
-f markdown \
-m 100 \
-c 5 \
-d 1500 \
--verbose
# IMPORTANT: Use double dash --
npm run dev -- "URL" -c 10 -d 2000 # ✅ CORRECT
npm run dev "URL" -c 10 # ❌ INCORRECT# Index documents
npm run rag:index <path-to-docs> [options]
# CLI chat (alternative to web interface)
npm run rag:chat
# Collection statistics
npm run rag:statsdoc-scrapper/
├── 📜 scripts/ # Operational scripts
│ ├── restart.sh # Universal service restart
│ ├── stop.sh # Service shutdown
│ └── clean-collections.* # Collection management
├── 🐳 docker/ # Docker containers
│ ├── docker-compose.yml # Service orchestration
│ ├── docker-start.sh # Container startup
│ └── Dockerfile.rag-api # RAG API container
├── ⚙️ config/ # Configuration files
│ ├── env-template.txt # Environment template
│ └── .dockerignore # Docker exclusions
├── 📚 docs/ # Documentation
│ ├── DOCKER_README.md # Docker deployment guide
│ └── AGENT.md # AI agent notes
├── 🚀 deployment/ # Production deployment
│ ├── deploy.sh # Deployment script
│ ├── webhook.js # GitHub webhook handler
│ └── webhook.service # Systemd service
├── 📊 logs/ # Application logs
├── 💾 src/ # Source code
├── 🌐 web-app/ # Next.js frontend
├── 🧠 memory-bank/ # Project context & history
└── 📋 Core files # README, package.json, etc.
- TypeScript - Strict typing
- Node.js v24+ - Runtime
- ChromaDB - Multi-collection vector DB
- OpenAI - Embeddings (text-embedding-3-small) + LLM (GPT-4o-mini)
- AI SDK - Vercel AI SDK v3.0
- Next.js 15 - Full-stack framework with App Router
- TypeScript - Type safety
- TailwindCSS - Modern styling with dark theme
- React Server/Client Components - Optimal architecture
- Universal restart scripts -
restart.sh,stop.sh - NPM scripts integration -
npm run restart,npm run stop - Port management - Automatic cleanup of ports 3000, 8000, 8001
- Process monitoring - Health checks and automatic recovery
npm run stop # Automatically free ports
# or manually:
lsof -ti:3000 | xargs kill -9
lsof -ti:8000 | xargs kill -9
lsof -ti:8001 | xargs kill -9# Check if running
curl http://localhost:8000/api/v1/heartbeat
# Restart ChromaDB
npm run chroma:restart# Next.js build
cd web-app && npm run build
# If utility function errors:
# Check that all utility functions are moved to lib/ folder- ✅ 3 active collections with different content
- ✅ Collection-specific queries work correctly
- ✅ UI/UX with expandable selector and real-time switching
- ✅ Web API integration (/api/collections, /api/chat) functional
- ✅ Form submission → real scraping process
- ✅ Progress tracking → real-time updates
- ✅ RAG indexing → automatic after scraping
- ✅ Chat responses → contextual from correct sources
- ✅ 487 URLs found via sitemap.xml
- ✅ 53 seconds scraping time
- ✅ 51 files created (280KB)
- ✅ Collection created automatically with 6,358 documents
git clone https://github.com/AlexSerbinov/doc-scrapper.gitnpm install && cd web-app && npm install && cd ..- Add
OPENAI_API_KEYto.envfile npm run restart- Open http://localhost:3000
- Enter documentation URL
- Wait for completion and use chat
npm run restart- start all services- Work with code
npm run stop- stop everything when finished
# Health check
curl http://localhost:8001/health
# List collections
curl http://localhost:8001/collections | jq .
# Query specific collection
curl -X POST http://localhost:8001/query \
-d '{"message":"test", "collectionName":"ai-sdk-dev-docs"}'- Multi-Collection system with UI
- Web App with full functionality
- Universal restart scripts
- Real-time progress tracking
- Collection-aware chat interface
- Authentication system with trial limitations
- CLI collection parameter (
--collection-name) - Export to single file functionality
- Embeddable chat widget for websites
- Production deployment guides
# .env file
OPENAI_API_KEY=sk-your-openai-api-key-here# RAG settings
RAG_LLM_MODEL=gpt-4o-mini
RAG_EMBEDDING_MODEL=text-embedding-3-small
RAG_VECTOR_STORE_CONNECTION_STRING=http://localhost:8000
# Web App settings
NEXT_PUBLIC_RAG_API_URL=http://localhost:8001- Fork the repository: https://github.com/AlexSerbinov/doc-scrapper
- Create feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open Pull Request
MIT License - see LICENSE file for details.
🔗 Useful Links:
- 📜 Scripts Documentation
- 🐳 Docker Guide
- ⚙️ Configuration Guide
- 📚 Documentation Hub
- 🚀 Deployment Guide
- 🐛 Report Bug
- 💡 Request Feature
- 💬 GitHub Discussions
🎯 System ready for production use as multi-collection documentation AI assistant! 🚀