🕸️ Scrapereq

A powerful and flexible web scraping API built with Express.js and Puppeteer

Website • API Documentation • Installation • Features • Contributing

📋 Overview

Scrapereq is a RESTful API service that allows you to perform web scraping operations by defining a series of steps executed by a headless browser. It provides a clean and secure way to extract data from websites with advanced features like proxy support, customizable scraping speeds, robust validation, and error handling.

✨ Features

Core Capabilities

🔄 Step-Based Scraping: Define your scraping workflow as a series of steps (navigate, click, wait, setViewport, etc.)
⚡ Speed Control: Multiple speed modes (TURBO, FAST, NORMAL, SLOW, SLOWEST, CRAWL, STEALTH)
🔍 Selector Support: Extract data using CSS, XPath, or full page HTML selectors
✅ Enhanced Validation: Comprehensive request validation with clear error messages

Security & Reliability

🔐 Built-in Security: Basic authentication, helmet protection, and CORS configuration
🌐 Enhanced Proxy Support: Advanced proxy configuration with authentication and multiple proxy rotation
🛡️ Error Handling: Consistent JSON error responses with contextual details and optional stack traces for debugging
💪 Browser Resilience: Automatic disconnection detection and resource management

Advanced Features

📸 Screenshot Capabilities: Capture success and error screenshots with configurable options
📊 API Monitoring: Detailed health check endpoint with system information
📝 Swagger Documentation: Interactive API documentation with detailed request/response examples
🔧 System Controls: Application shutdown and OS restart endpoints
💾 Persistent Storage: Configurable screenshot directory for persistent storage across deployments
🧹 Automatic Cleanup: Automated cleanup of old screenshot files
📈 Performance Metrics: Track and analyze scraping performance with detailed metrics
🔁 Retry Mechanism: Intelligent retry functionality for handling transient errors
🛠️ CLI Utilities: User-friendly command-line interface for development and deployment

🛠️ Tech Stack

📦 Node.js: JavaScript runtime
🚀 Express.js v5.1.0: Web application framework
🤖 Puppeteer v24.8.0: Headless Chrome browser automation
🧩 Puppeteer-Extra v3.3.6: Plugin system for Puppeteer
⏺️ @puppeteer/replay v3.1.1: Record and replay browser interactions
✅ Joi v17.13.3: Request validation
📝 Morgan: HTTP request logging
🛡️ Helmet v8.1.0: Security middleware
📚 Swagger-JSDoc v6.2.8: API documentation generation
🌐 Swagger-UI-Express v5.0.1: Interactive API documentation
🌐 CORS: Cross-Origin Resource Sharing support
⚙️ dotenv v16.5.0: Environment configuration
🧪 Jest & Supertest: Testing framework

🚀 Installation

Clone the repository:

git clone https://github.com/erdinccurebal/scrapereq.git
cd scrapereq

Install dependencies:
```
npm install
```

Create a configuration file:

Create a .env file in the root directory based on the following template:

# Server Configuration
PORT=3000
HOST=localhost
NODE_ENV=development
WEB_ADDRESS=http://localhost:3000

# Authentication
AUTH_USERNAME=admin
AUTH_PASSWORD=secretpassword

# Puppeteer Configuration
CHROME_PATH=/path/to/chrome # Optional custom Chrome path

# File Storage
TMP_DIR=/path/to/persistent/directory # Optional: defaults to ./tmp

# Browser Concurrency
MAX_CONCURRENT_BROWSERS=2 # Number of concurrent browser instances

# Rate Limiting
RATE_LIMIT_WINDOW_MS=900000 # 15 minutes in milliseconds
RATE_LIMIT_MAX_REQUESTS=100 # Maximum requests per window

# Proxy Configuration (Optional)
SCRAPE_PROXY_BYPASS_CODE=your_secure_password # Password to bypass proxy requirement

Start the application:
```
npm start
```

🐳 Docker Deployment

You can easily run the application using Docker:

# Build the Docker image
npm run docker:build

# Run the container
npm run docker:run

Or use the provided docker-compose.yml:

docker-compose up -d

🔌 API Endpoints

The API provides the following main endpoints:

🔍 Health Check

GET /api/app/health

Returns detailed system information and checks if all components are working correctly.

🕸️ Scraper

POST /api/scrape/start

Main endpoint for web scraping operations. Configure your scraping workflow with a detailed JSON structure.

Example Request:

📋 View example request body

{
  "proxy": {
    "bypassCode": "your_secure_password",
    "auth": {
      "enabled": true,
      "username": "proxyuser",
      "password": "proxypass"
    },
    "servers": [
      {
        "server": "proxy1.example.com",
        "port": 8080
      },
      {
        "server": "proxy2.example.com",
        "port": 8081
      }
    ]
  },
  "record": {
    "title": "Google Search Example",
    "speedMode": "NORMAL",
    "timeoutMode": "NORMAL",
    "steps": [
      {
        "type": "navigate",
        "url": "https://www.google.com"
      },
      {
        "type": "wait",
        "value": "1000"
      },
      {
        "type": "setViewport",
        "width": 1366,
        "height": 768
      },
      {
        "type": "click",
        "selectors": [["#L2AGLb"]]
      },
      {
        "type": "change",
        "selectors": [["input[name='q']"]],
        "value": "web scraping api"
      },
      {
        "type": "click",
        "selectors": [["input[name='btnK']"]]
      },
      {
        "type": "waitForElement",
        "selectors": [["#search"]]
      }
    ]
  },
  "capture": {
    "selectors": [
      {
        "key": "search_results",
        "type": "CSS",
        "value": "#search"
      },
      {
        "key": "page_title",
        "type": "CSS",
        "value": "title"
      }
    ]
  },
  "headers": {
    "Accept-Language": "en-US,en;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
  },
  "output": {
    "screenshots": {
      "onError": true,
      "onSuccess": true
    },
    "responseType": "JSON"
  }
}

🧪 Test Endpoint

POST /api/scrape/test

Runs a predefined scraping test using a fixed configuration. This endpoint is useful for:

Testing if the scraping service is working correctly
Checking proxy connectivity
Validating browser functionality

The test endpoint uses a predefined configuration from constants.js with a sample scrape request that checks your IP address using a proxied connection.

🔧 System Management

POST /api/app/shutdown

Safely shuts down the application.

POST /api/os/restart

Initiates an operating system restart (requires appropriate permissions).

📚 Documentation

For complete API documentation, visit the Swagger UI endpoint after starting the application:

http://localhost:3000/api/docs

🔍 Selector Types

Data can be extracted using different selector methods:

Selector Type	Usage
`CSS`	Standard CSS selectors
`XPATH`	XPath expressions
`FULL`	Retrieves the full page HTML content

🔄 Response Types

The scraper supports multiple response formats:

Type	Description
`JSON`	Returns structured JSON with data and metadata
`RAW`	Returns raw content from the first selector
`NONE`	No response content (useful for headless operations)

⚠️ Error Handling

The API implements a consistent error handling pattern:

Standardized Format: All errors return a consistent JSON structure
Contextual Information: Includes error code, message, and related data
Debug Support: Stack traces included in development mode
Visual Evidence: Error screenshots for visual debugging
Step Identification: Clear indication of which step in the process failed
Proxy Errors: Detailed information about proxy-related issues

Example error response:

{
  "success": false,
  "data": {
    "message": "Failed to execute click operation on element",
    "code": "ERROR_ELEMENT_NOT_FOUND",
    "stepIndex": 3,
    "screenshotUrl": "/tmp/error-screenshot-123456.png"
  }
}

🛠️ CLI Startup Options

The project includes several command-line utility scripts:

# Start the application
npm start

# Run tests
npm test

# Lint code
npm run lint

# Fix linting issues
npm run lint:fix

# Format code with Prettier
npm run format

# Docker operations
npm run docker:build   # Build Docker image
npm run docker:run     # Run Docker container

📁 Project Structure

.
├── index.js                # Entry point
├── docker-compose.yml      # Docker Compose configuration
├── Dockerfile              # Docker configuration
├── src/                    # Application source code
│   ├── app.js              # Express app configuration
│   ├── config.js           # Configuration module
│   ├── constants.js        # Constants and enums
│   ├── controllers/        # Request handlers
│   │   ├── error-handler.js # Global error handling middleware
│   │   └── api/            # API controllers
│   ├── helpers/            # Helper functions
│   │   ├── browser-semaphore.js   # Browser instance management
│   │   ├── cleanup-screenshots.js # Screenshot cleanup utility
│   │   ├── do-scraping.js         # Main scraping logic
│   │   ├── proxies-random-get-one.js # Proxy rotation utility
│   │   ├── scrape-validate-req-body.js # Request validation
│   │   └── validators.js          # Schema validation definitions
│   ├── routes/             # API route definitions
│   └── utils/              # Utility middleware
├── __tests__/              # Test files
└── tmp/                    # Temporary files directory

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide before submitting a Pull Request.

By participating in this project, you agree to abide by our Code of Conduct.

🔒 Security

For security vulnerabilities, please see our Security Policy. Do not open a public issue for security concerns.

📄 License

This project is licensed under the ISC License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
__tests__		__tests__
docs		docs
src		src
tmp		tmp
.env.example		.env.example
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc.cjs		.prettierrc.cjs
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
eslint.config.js		eslint.config.js
index.js		index.js
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕸️ Scrapereq

A powerful and flexible web scraping API built with Express.js and Puppeteer

📋 Overview

✨ Features

Core Capabilities

Security & Reliability

Advanced Features

🛠️ Tech Stack

🚀 Installation

🐳 Docker Deployment

🔌 API Endpoints

🔍 Health Check

🕸️ Scraper

Example Request:

🧪 Test Endpoint

🔧 System Management

📚 Documentation

🔍 Selector Types

🔄 Response Types

⚠️ Error Handling

🛠️ CLI Startup Options

📁 Project Structure

🤝 Contributing

🔒 Security

📄 License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕸️ Scrapereq

A powerful and flexible web scraping API built with Express.js and Puppeteer

📋 Overview

✨ Features

Core Capabilities

Security & Reliability

Advanced Features

🛠️ Tech Stack

🚀 Installation

🐳 Docker Deployment

🔌 API Endpoints

🔍 Health Check

🕸️ Scraper

Example Request:

🧪 Test Endpoint

🔧 System Management

📚 Documentation

🔍 Selector Types

🔄 Response Types

⚠️ Error Handling

🛠️ CLI Startup Options

📁 Project Structure

🤝 Contributing

🔒 Security

📄 License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages