📚 Doc Scrapper AI

Universal documentation scraper with Multi-Collection RAG system and intelligent chat bot

Automatically extracts technical documentation from websites, indexes it into separate vector collections, and provides contextual AI responses with the ability to select specific documentation sources.

🎯 System Architecture

3 independent servers:

Server	Port	Purpose	Status
ChromaDB	8000	Vector DB with multi-collection support	✅
RAG API	8001	HTTP API for AI queries	✅
Web App	3000	Next.js interface with UI	✅

✨ Key Features (UPDATED)

🗂️ Multi-Collection System ⭐ NEW

Separate collections for each documentation project
Collection Selector UI with project grouping
Contextual AI responses from correct sources
Dynamic switching between collections without restart

📄 Documentation Consolidation for LLM ⭐ NEW

Single file with all documentation for large LLM contexts
Google Gemini Flash/Pro support (1M+ tokens)
ChatGPT-4.1 compatibility (128K+ tokens)
Markdown rendering with syntax highlighting
Copy/Download functionality from web interface
Token statistics and file size metrics

🔄 Selective Service Restart ⭐ NEW

Include mode: Restart only selected services
Exclude mode: Restart all except selected
Smart cleanup: Clean only relevant ports
Service monitoring: Display status of all services
Flexible commands: npm run restart:web, npm run restart:backend, etc.

🔍 Intelligent Scraping

Automatic page discovery via sitemap.xml or navigation
Smart content extraction without navigation, ads, and unnecessary elements
Structured file organization according to original site structure
Real-time progress tracking through web interface

🤖 RAG System (Retrieval-Augmented Generation)

Vector indexing of documents via OpenAI embeddings
Multi-collection management with REST API
Semantic search for relevant content
Intelligent responses using GPT-4o-mini

🌐 Full-Featured Web App ⭐ READY

Trial activation form with real-time scraping
Collection Selector with documentation grouping
Chat Interface with contextual AI responses
Progress tracking for scraping operations
Consolidation feature for LLM integration

🚀 Quick Start (UPDATED)

🔥 GitHub Launch (Recommended)

# 1. Clone repository
git clone https://github.com/AlexSerbinov/doc-scrapper.git
cd doc-scrapper

# 2. Install dependencies
npm install
cd web-app && npm install && cd ..

# 3. Create .env file
cp config/env-template.txt .env
# Add your OPENAI_API_KEY to .env file

# 4. Build and run (NEW!)
npm run restart  # ⭐ Universal restart of all services

# 5. Open web app
open http://localhost:3000

⚙️ Server Management (EXPANDED)

# === FULL RESTART ===
npm run restart         # ⭐ Universal restart of all services
npm run dev:all         # Alternative launch

# === SELECTIVE RESTART ===
npm run restart:web     # ⭐ NEW: Web app only
npm run restart:rag     # ⭐ NEW: RAG API only
npm run restart:chroma  # ⭐ NEW: ChromaDB only

# === EXCLUDE MODE ===
npm run restart:backend    # ⭐ NEW: Everything except web app
npm run restart:except-web # ⭐ NEW: Everything except web app
npm run restart:except-rag # ⭐ NEW: Everything except RAG API

# === SHUTDOWN ===
npm run stop           # ⭐ Stop everything

# === INDIVIDUAL LAUNCH ===
npm run dev:backend    # ChromaDB + RAG API
npm run web:dev        # Web app only

# === MONITORING ===
npm run health         # RAG API health check
curl http://localhost:8001/collections  # Collection list

📚 Multi-Collection System in Action

Current collections:

📚 Active collections:
├── ai-sdk-dev-docs (6,358 documents)   # AI SDK documentation
├── astro-test (6,216 documents)        # Astro documentation  
└── doc-scrapper-docs (3,178 documents) # Doc Scrapper documentation

How to use:

Open web app: http://localhost:3000
Enter documentation URL in form (or use existing collections)
Wait for completion of scraping + indexing process
Select collection in Collection Selector
Ask questions and receive contextual responses

LLM Consolidation:

Select collection in web interface
Click "Consolidate for LLM" button
View consolidated documentation with markdown rendering
Copy or download for use with large LLMs
Statistics: File count, size, token estimation

REST API for collections:

# List all collections
curl http://localhost:8001/collections

# Query specific collection
curl -X POST http://localhost:8001/query \
  -d '{"message":"How to use AI SDK?", "collectionName":"ai-sdk-dev-docs"}'

# Consolidate collection
curl -X POST http://localhost:8001/consolidate \
  -d '{"collectionName":"ai-sdk-dev-docs", "projectName":"AI SDK"}'

# Switch collection
curl -X POST http://localhost:8001/switch-collection \
  -d '{"collectionName":"astro-test"}'

🌐 Web Interface Features

✅ Ready Features:

🎯 Trial Activation Form - real-time scraping of new documentation
📊 Progress Tracking - live updates during processing
🗂️ Collection Selector - choose between documentation projects
💬 Chat Interface - AI assistant with contextual responses
📄 LLM Consolidation - combine documentation for large LLMs
📋 Copy/Download - convenient copying and downloading
📱 Responsive Design - adaptive design
🔄 Real-time Switching - switch between collections

🚀 Workflow:

URL Input → Progress Bar → Scraping → Indexing → 
Collection Created → Demo Chat → AI Responses → LLM Consolidation

📖 Detailed Usage

CLI Scraping (alternative method)

# Basic scraping via CLI
npm run build  # Compile before use
npm run dev -- "https://docs.example.com"

# With settings
npm run dev -- "https://ai-sdk.dev/docs/introduction" \
  -o ./my-docs \
  -f markdown \
  -m 100 \
  -c 5 \
  -d 1500 \
  --verbose

# IMPORTANT: Use double dash --
npm run dev -- "URL" -c 10 -d 2000  # ✅ CORRECT
npm run dev "URL" -c 10              # ❌ INCORRECT

RAG System Commands

# Index documents
npm run rag:index <path-to-docs> [options]

# CLI chat (alternative to web interface)
npm run rag:chat

# Collection statistics
npm run rag:stats

📁 Project Structure

doc-scrapper/
├── 📜 scripts/             # Operational scripts
│   ├── restart.sh          # Universal service restart
│   ├── stop.sh             # Service shutdown
│   └── clean-collections.* # Collection management
├── 🐳 docker/              # Docker containers
│   ├── docker-compose.yml  # Service orchestration
│   ├── docker-start.sh     # Container startup
│   └── Dockerfile.rag-api  # RAG API container
├── ⚙️ config/              # Configuration files
│   ├── env-template.txt    # Environment template
│   └── .dockerignore       # Docker exclusions
├── 📚 docs/                # Documentation
│   ├── DOCKER_README.md    # Docker deployment guide
│   └── AGENT.md            # AI agent notes
├── 🚀 deployment/          # Production deployment
│   ├── deploy.sh           # Deployment script
│   ├── webhook.js          # GitHub webhook handler
│   └── webhook.service     # Systemd service
├── 📊 logs/                # Application logs
├── 💾 src/                 # Source code
├── 🌐 web-app/             # Next.js frontend
├── 🧠 memory-bank/         # Project context & history
└── 📋 Core files           # README, package.json, etc.

🛠️ Technologies (UPDATED)

Backend Stack

TypeScript - Strict typing
Node.js v24+ - Runtime
ChromaDB - Multi-collection vector DB
OpenAI - Embeddings (text-embedding-3-small) + LLM (GPT-4o-mini)
AI SDK - Vercel AI SDK v3.0

Frontend Stack ⭐ NEW

Next.js 15 - Full-stack framework with App Router
TypeScript - Type safety
TailwindCSS - Modern styling with dark theme
React Server/Client Components - Optimal architecture

DevOps & Scripts ⭐ NEW

Universal restart scripts - restart.sh, stop.sh
NPM scripts integration - npm run restart, npm run stop
Port management - Automatic cleanup of ports 3000, 8000, 8001
Process monitoring - Health checks and automatic recovery

🔧 Troubleshooting

Ports occupied:

npm run stop           # Automatically free ports
# or manually:
lsof -ti:3000 | xargs kill -9
lsof -ti:8000 | xargs kill -9  
lsof -ti:8001 | xargs kill -9

ChromaDB issues:

# Check if running
curl http://localhost:8000/api/v1/heartbeat

# Restart ChromaDB
npm run chroma:restart

Web app build errors:

# Next.js build
cd web-app && npm run build

# If utility function errors:
# Check that all utility functions are moved to lib/ folder

📊 Test Results (UPDATED)

Multi-Collection System:

✅ 3 active collections with different content
✅ Collection-specific queries work correctly
✅ UI/UX with expandable selector and real-time switching
✅ Web API integration (/api/collections, /api/chat) functional

Performance tests:

✅ Form submission → real scraping process
✅ Progress tracking → real-time updates
✅ RAG indexing → automatic after scraping
✅ Chat responses → contextual from correct sources

Test on ai-sdk.dev:

✅ 487 URLs found via sitemap.xml
✅ 53 seconds scraping time
✅ 51 files created (280KB)
✅ Collection created automatically with 6,358 documents

🎯 Ready Use Cases

Scenario 1: New User

git clone https://github.com/AlexSerbinov/doc-scrapper.git
npm install && cd web-app && npm install && cd ..
Add OPENAI_API_KEY to .env file
npm run restart
Open http://localhost:3000
Enter documentation URL
Wait for completion and use chat

Scenario 2: Developer

npm run restart - start all services
Work with code
npm run stop - stop everything when finished

Scenario 3: API Testing

# Health check
curl http://localhost:8001/health

# List collections
curl http://localhost:8001/collections | jq .

# Query specific collection
curl -X POST http://localhost:8001/query \
  -d '{"message":"test", "collectionName":"ai-sdk-dev-docs"}'

🚧 Roadmap

✅ Completed:

Multi-Collection system with UI
Web App with full functionality
Universal restart scripts
Real-time progress tracking
Collection-aware chat interface

🔄 Planned:

Authentication system with trial limitations
CLI collection parameter (--collection-name)
Export to single file functionality
Embeddable chat widget for websites
Production deployment guides

📄 Environment Setup

Required:

# .env file
OPENAI_API_KEY=sk-your-openai-api-key-here

Optional:

# RAG settings
RAG_LLM_MODEL=gpt-4o-mini
RAG_EMBEDDING_MODEL=text-embedding-3-small
RAG_VECTOR_STORE_CONNECTION_STRING=http://localhost:8000

# Web App settings
NEXT_PUBLIC_RAG_API_URL=http://localhost:8001

🤝 Contributing

Fork the repository: https://github.com/AlexSerbinov/doc-scrapper
Create feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open Pull Request

📄 License

MIT License - see LICENSE file for details.

🔗 Useful Links:

🎯 System ready for production use as multi-collection documentation AI assistant! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
config		config
deployment		deployment
docker		docker
docs		docs
memory-bank		memory-bank
scripts		scripts
src		src
tests		tests
web-app		web-app
.cursorrules		.cursorrules
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

AlexSerbinov/doc-scrapper

Folders and files

Latest commit

History

Repository files navigation

📚 Doc Scrapper AI

🎯 System Architecture

3 independent servers:

✨ Key Features (UPDATED)

🗂️ Multi-Collection System ⭐ NEW

📄 Documentation Consolidation for LLM ⭐ NEW

🔄 Selective Service Restart ⭐ NEW

🔍 Intelligent Scraping

🤖 RAG System (Retrieval-Augmented Generation)

🌐 Full-Featured Web App ⭐ READY

🚀 Quick Start (UPDATED)

🔥 GitHub Launch (Recommended)

⚙️ Server Management (EXPANDED)

📚 Multi-Collection System in Action

Current collections:

How to use:

LLM Consolidation:

REST API for collections:

🌐 Web Interface Features

✅ Ready Features:

🚀 Workflow:

📖 Detailed Usage

CLI Scraping (alternative method)

RAG System Commands

📁 Project Structure

🛠️ Technologies (UPDATED)

Backend Stack

Frontend Stack ⭐ NEW

DevOps & Scripts ⭐ NEW

🔧 Troubleshooting

Ports occupied:

ChromaDB issues:

Web app build errors:

📊 Test Results (UPDATED)

Multi-Collection System:

Performance tests:

Test on ai-sdk.dev:

🎯 Ready Use Cases

Scenario 1: New User

Scenario 2: Developer

Scenario 3: API Testing

🚧 Roadmap

✅ Completed:

🔄 Planned:

📄 Environment Setup

Required:

Optional:

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages