Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"permissions": {
"allow": [
"WebFetch(domain:modelcontextprotocol.io)",
"WebFetch(domain:blog.cloudflare.com)",
"mcp__grep__searchGitHub",
"Bash(npm run build:*)",
"Bash(npm install:*)",
"Bash(npm start)",
"Bash(TRANSPORT_MODE=http npm start)",
"Bash(TRANSPORT_MODE=http HTTP_PORT=3001 npm start)",
Comment on lines +7 to +11

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This pull request migrates the project to use pnpm. For consistency, these Bash commands should be updated from npm to pnpm.

      "Bash(pnpm run build:*)",
      "Bash(pnpm install:*)",
      "Bash(pnpm start)",
      "Bash(TRANSPORT_MODE=http pnpm start)",
      "Bash(TRANSPORT_MODE=http HTTP_PORT=3001 pnpm start)"

"Bash(python test:*)",
"Bash(touch:*)",
"Bash(node:*)",
"Bash(gh pr view:*)",
"WebFetch(domain:github.com)"
],
"deny": []
}
}
64 changes: 64 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Node.js
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*

# Build outputs
build/
dist/

# Development files
.env
.env.local
.env.development
.env.test
.env.production

# IDE and editor files
.vscode/
.idea/
*.swp
*.swo
*~

# Operating system files
.DS_Store
Thumbs.db

# Git
.git/
.gitignore

# Documentation and development
README.md
CLAUDE.md
docs/
*.md
test_tools.py
MCP_WebScan_HTTP_Transport_Test_Report.md

# Testing
coverage/
.nyc_output/
test/
tests/
*.test.js
*.spec.js

# Linting
.eslintrc*
.prettierrc*

# CI/CD
.github/
.gitlab-ci.yml
.travis.yml

# Logs
logs/
*.log

# Temporary files
tmp/
temp/
31 changes: 31 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# MCP WebScan Server Configuration
# Copy this file to .env and adjust the values for your environment

# Transport Configuration
TRANSPORT_MODE=stdio # stdio (default) | http
HTTP_PORT=3000 # Port for HTTP transport
HTTP_HOST=localhost # Host for HTTP transport (use 0.0.0.0 for Docker)
LOG_LEVEL=info # debug | info | warn | error

# GitHub OAuth Configuration (HTTP Transport Only)
# Create a GitHub OAuth App at: https://github.com/settings/developers
GITHUB_CLIENT_ID=your_github_client_id
GITHUB_CLIENT_SECRET=your_github_client_secret
GITHUB_REDIRECT_URI=http://localhost:3000/auth/github/callback
GITHUB_SCOPES=read:user,user:email # OAuth scopes to request

# Authentication Settings (HTTP Transport Only)
REQUIRE_AUTH=false # Set to true to require authentication
REQUIRED_SCOPES=read:user # Comma-separated list of required scopes

# Session Configuration
SESSION_ENABLED=true # Enable stateful sessions (recommended)
COOKIE_SECRET=your_random_secret # Secret for session management

# CORS Configuration (HTTP Transport Only)
CORS_ORIGINS=* # Comma-separated list of allowed origins
DNS_REBINDING_PROTECTION=false # Enable DNS rebinding protection
ALLOWED_HOSTS=localhost,127.0.0.1 # Comma-separated list of allowed hosts

# Production Environment Variables
NODE_ENV=production # Set to production for deployment
268 changes: 268 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

A Model Context Protocol (MCP) server for web content scanning and analysis that provides 6 main tools for fetching, analyzing, and extracting information from web pages. Built with TypeScript and designed for compatibility with MCP clients like Claude Desktop. Supports both stdio and HTTP transports with optional GitHub OAuth authentication.

## Document References

- **./docs/llms-full.txt**: contains valueable information for working with Model Context Protocol
- **./docs/README.md**: contains information on the Typescript SDK for Model Context Protocol


## Key Components

- **Core Tools**: fetch-page, extract-links, crawl-site, check-links, find-patterns, generate-site-map
- **Architecture**: Service layer (in `src/services/`) with corresponding tool definitions (in `src/tools/`)
- **Transport**: Dual transport support - stdio (default) and HTTP with Streamable HTTP (MCP 2025-03-26)
- **Authentication**: Optional GitHub OAuth integration for HTTP transport
- **Error Handling**: Custom error classes in `src/utils/errors.ts` with structured JSON logging

## Available Commands

```bash
# Install dependencies
npm install

# Build project (TypeScript compilation)
npm run build

# Development mode (with ts-node)
npm run dev

# Start production server
npm start
Comment on lines +27 to +36

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The project has been migrated to pnpm, but the command examples in this document still use npm. To ensure the documentation is accurate and to avoid confusion for developers, please update these commands to use pnpm.

Suggested change
npm install
# Build project (TypeScript compilation)
npm run build
# Development mode (with ts-node)
npm run dev
# Start production server
npm start
# Install dependencies
pnpm install
# Build project (TypeScript compilation)
pnpm run build
# Development mode (with ts-node)
pnpm run dev
# Start production server
pnpm start


# Environment variables - Transport Configuration
TRANSPORT_MODE=stdio # stdio (default) | http
HTTP_PORT=3000 # Port for HTTP transport
HTTP_HOST=localhost # Host for HTTP transport
LOG_LEVEL=debug # Set logging level (debug|info|warn|error)

# Optional GitHub OAuth Configuration (for HTTP transport)
GITHUB_CLIENT_ID=your_client_id
GITHUB_CLIENT_SECRET=your_client_secret
GITHUB_REDIRECT_URI=http://localhost:3000/auth/github/callback
GITHUB_SCOPES=read:user,user:email
REQUIRE_AUTH=false # Set to true to require authentication
REQUIRED_SCOPES=read:user # Comma-separated list of required scopes
```

## Architecture Details

### File Structure
```
src/
├── auth/ # Authentication system
│ ├── GitHubOAuth.ts # GitHub OAuth provider implementation
│ └── AuthMiddleware.ts # Express authentication middleware
├── config/ # Configuration management
│ ├── ConfigurationManager.ts # Original configuration manager
│ └── TransportConfig.ts # Transport-specific configuration
├── services/ # Business logic services (6 services)
│ ├── CheckLinksService.ts
│ ├── CrawlSiteService.ts
│ ├── ExtractLinksService.ts
│ ├── FetchPageService.ts
│ ├── FindPatternsService.ts
│ └── GenerateSitemapService.ts
├── tools/ # MCP tool definitions and registration
│ ├── *.ts # Tool implementations for each service
│ ├── *Params.ts # Zod schemas for tool parameters
│ └── index.ts # Central tool registration
├── transports/ # Transport layer implementations
│ ├── HttpTransport.ts # HTTP/Streamable transport with auth
│ └── StdioTransport.ts # Original stdio transport wrapper
├── types/ # TypeScript type definitions
├── utils/ # Shared utilities
│ ├── errors.ts # Custom error classes
│ ├── logger.ts # Structured JSON logging to stderr
│ ├── webUtils.ts # HTTP utilities and htmlToMarkdown
│ ├── markdownConverter.ts # HTML to Markdown conversion
│ └── index.ts
├── initialize.ts # Server initialization
└── index.ts # Main entry point with transport selection
```

### Key Patterns

- **Service Layer**: Each tool has a corresponding service class in `src/services/`
- **Tool Registration**: `registerTools()` in `src/tools/index.ts` handles centralized registration
- **Transport Abstraction**: Dual transport support with automatic selection based on TRANSPORT_MODE
- **Authentication**: Optional OAuth integration for secure remote access
- **Configuration**: Environment-based configuration via TransportConfig and ConfigurationManager
- **Error Handling**: Uses custom error classes (ValidationError, NotFoundError, ServiceError)
- **Logging**: Structured JSON logging to stderr with configurable levels

## Development Workflow

### Stdio Transport (Default)
For Claude Desktop integration, add to `mcpServers` config:
```json
{
"webscan": {
"command": "node",
"args": ["path/to/mcp-server-webscan/build/index.js"],
"env": {"LOG_LEVEL": "info"}
}
}
```

### HTTP Transport (Remote)
For remote/web-based MCP clients:
```bash
# Start HTTP server
TRANSPORT_MODE=http HTTP_PORT=3000 npm start

# With GitHub OAuth (optional)
TRANSPORT_MODE=http \
GITHUB_CLIENT_ID=your_client_id \
GITHUB_CLIENT_SECRET=your_client_secret \
GITHUB_REDIRECT_URI=http://localhost:3000/auth/github/callback \
REQUIRE_AUTH=true \
npm start
```

Client configuration for HTTP transport:
```json
{
"webscan": {
"url": "http://localhost:3000/mcp",
"headers": {
"Authorization": "Bearer github_access_token"
}
}
}
```

### Authentication Endpoints (HTTP Transport)
- `GET /auth/github` - Initiate GitHub OAuth flow
- `GET /auth/github/callback` - OAuth callback handler
- `GET /auth/user` - Get authenticated user info
- `POST /auth/revoke` - Revoke access token
- `GET /health` - Health check with auth status

All tools follow the same pattern: tool parameter validation (Zod) → service method → result formatting → MCP response. Services use axios/fetch for HTTP, cheerio for HTML parsing, and turndown for HTML-to-Markdown conversion.

## Technical Implementation Details

### Streamable HTTP Transport (MCP 2025-03-26)

The HTTP transport implementation follows the latest MCP specification:

- **Single Endpoint**: All communication flows through `/mcp` endpoint
- **Bi-directional**: Supports both client-to-server and server-to-client communication
- **Session Management**: Stateful connections with UUID-based session IDs
- **SSE Support**: Server-sent events for real-time notifications
- **Message Size Limits**: Configurable limits (default 4MB) with proper error handling

Key implementation files:
- `src/transports/HttpTransport.ts` - Main HTTP transport implementation
- `src/config/TransportConfig.ts` - Environment-based configuration
- `src/index.ts` - Transport selection and initialization

### GitHub OAuth Integration

Complete OAuth 2.0 implementation with security best practices:

**OAuth Flow**:
1. Client initiates flow via `GET /auth/github`
2. Server generates secure state parameter and redirects to GitHub
3. GitHub redirects back to `GET /auth/github/callback` with authorization code
4. Server exchanges code for access token
5. Token stored with expiration and user context
6. Client uses token in `Authorization: Bearer` header

**Security Features**:
- State parameter validation to prevent CSRF attacks
- Token expiration and automatic cleanup
- Scope validation for fine-grained permissions
- Secure session management with configurable timeouts

Key implementation files:
- `src/auth/GitHubOAuth.ts` - OAuth provider implementation
- `src/auth/AuthMiddleware.ts` - Express middleware for authentication
- Environment variables for configuration (see documentation above)

### Stateful vs Stateless Considerations

**Current Implementation: Stateful**
- Session-based connections with persistent state
- Better for complex multi-step operations (crawling, analysis)
- Enables server-to-client notifications
- Supports connection resume and state recovery
- Automatic resource cleanup on session termination

**When to Use Stateless**:
- Simple request-response operations
- Horizontally scaled deployments
- Stateless microservice architectures
- RESTful API patterns

To switch to stateless mode, set `sessionIdGenerator: undefined` in StreamableHTTPServerTransport options.

### Transport Decision Matrix

| Feature | Stdio | HTTP (Stateful) | HTTP (Stateless) |
|---------|-------|-----------------|------------------|
| Local Integration | ✅ | ❌ | ❌ |
| Remote Access | ❌ | ✅ | ✅ |
| Authentication | ❌ | ✅ | ✅ |
| Session State | N/A | ✅ | ❌ |
| Notifications | ✅ | ✅ | ❌ |
| Horizontal Scaling | N/A | ❌ | ✅ |
| Claude Desktop | ✅ | ❌ | ❌ |
| Web Clients | ❌ | ✅ | ✅ |

### Error Handling Strategy

**Structured Error Response**:
```json
{
"jsonrpc": "2.0",
"error": {
"code": -32603,
"message": "Descriptive error message",
"data": {
"type": "ValidationError",
"details": "Additional context"
}
},
"id": "request_id"
}
```

**Custom Error Classes**:
- `ValidationError` - Invalid parameters or malformed requests
- `NotFoundError` - Resource not found (404 equivalent)
- `ServiceError` - Internal service errors with context
- `AuthError` - Authentication/authorization failures

### Logging and Monitoring

**Structured JSON Logging**:
- All logs output to stderr in JSON format
- Configurable log levels (debug, info, warn, error)
- Request correlation IDs for tracing
- Performance metrics and timing information

**Health Check Endpoint**:
```json
{
"status": "healthy",
"timestamp": "2025-07-19T19:10:08.473Z",
"transport": "streamable-http",
"auth": {
"isConfigured": true,
"requireAuth": false,
"activeTokens": 3,
"requiredScopes": ["read:user"],
"excludePaths": ["/health", "/auth"]
},
"activeSessions": 2
}
```


Loading