Skip to content

erdinccurebal/scrapereq

πŸ•ΈοΈ Scrapereq

A powerful and flexible web scraping API built with Express.js and Puppeteer

Express.js Puppeteer Node.js License CI

Website β€’ API Documentation β€’ Installation β€’ Features β€’ Contributing

πŸ“‹ Overview

Scrapereq is a RESTful API service that allows you to perform web scraping operations by defining a series of steps executed by a headless browser. It provides a clean and secure way to extract data from websites with advanced features like proxy support, customizable scraping speeds, robust validation, and error handling.

✨ Features

Core Capabilities

  • πŸ”„ Step-Based Scraping: Define your scraping workflow as a series of steps (navigate, click, wait, setViewport, etc.)
  • ⚑ Speed Control: Multiple speed modes (TURBO, FAST, NORMAL, SLOW, SLOWEST, CRAWL, STEALTH)
  • πŸ” Selector Support: Extract data using CSS, XPath, or full page HTML selectors
  • βœ… Enhanced Validation: Comprehensive request validation with clear error messages

Security & Reliability

  • πŸ” Built-in Security: Basic authentication, helmet protection, and CORS configuration
  • 🌐 Enhanced Proxy Support: Advanced proxy configuration with authentication and multiple proxy rotation
  • πŸ›‘οΈ Error Handling: Consistent JSON error responses with contextual details and optional stack traces for debugging
  • πŸ’ͺ Browser Resilience: Automatic disconnection detection and resource management

Advanced Features

  • πŸ“Έ Screenshot Capabilities: Capture success and error screenshots with configurable options
  • πŸ“Š API Monitoring: Detailed health check endpoint with system information
  • πŸ“ Swagger Documentation: Interactive API documentation with detailed request/response examples
  • πŸ”§ System Controls: Application shutdown and OS restart endpoints
  • πŸ’Ύ Persistent Storage: Configurable screenshot directory for persistent storage across deployments
  • 🧹 Automatic Cleanup: Automated cleanup of old screenshot files
  • πŸ“ˆ Performance Metrics: Track and analyze scraping performance with detailed metrics
  • πŸ” Retry Mechanism: Intelligent retry functionality for handling transient errors
  • πŸ› οΈ CLI Utilities: User-friendly command-line interface for development and deployment

πŸ› οΈ Tech Stack

  • πŸ“¦ Node.js: JavaScript runtime
  • πŸš€ Express.js v5.1.0: Web application framework
  • πŸ€– Puppeteer v24.8.0: Headless Chrome browser automation
  • 🧩 Puppeteer-Extra v3.3.6: Plugin system for Puppeteer
  • ⏺️ @puppeteer/replay v3.1.1: Record and replay browser interactions
  • βœ… Joi v17.13.3: Request validation
  • πŸ“ Morgan: HTTP request logging
  • πŸ›‘οΈ Helmet v8.1.0: Security middleware
  • πŸ“š Swagger-JSDoc v6.2.8: API documentation generation
  • 🌐 Swagger-UI-Express v5.0.1: Interactive API documentation
  • 🌐 CORS: Cross-Origin Resource Sharing support
  • βš™οΈ dotenv v16.5.0: Environment configuration
  • πŸ§ͺ Jest & Supertest: Testing framework

πŸš€ Installation

  1. Clone the repository:

    git clone https://github.com/erdinccurebal/scrapereq.git
    cd scrapereq
  2. Install dependencies:

    npm install
  3. Create a configuration file:

    Create a .env file in the root directory based on the following template:

    # Server Configuration
    PORT=3000
    HOST=localhost
    NODE_ENV=development
    WEB_ADDRESS=http://localhost:3000
    
    # Authentication
    AUTH_USERNAME=admin
    AUTH_PASSWORD=secretpassword
    
    # Puppeteer Configuration
    CHROME_PATH=/path/to/chrome # Optional custom Chrome path
    
    # File Storage
    TMP_DIR=/path/to/persistent/directory # Optional: defaults to ./tmp
    
    # Browser Concurrency
    MAX_CONCURRENT_BROWSERS=2 # Number of concurrent browser instances
    
    # Rate Limiting
    RATE_LIMIT_WINDOW_MS=900000 # 15 minutes in milliseconds
    RATE_LIMIT_MAX_REQUESTS=100 # Maximum requests per window
    
    # Proxy Configuration (Optional)
    SCRAPE_PROXY_BYPASS_CODE=your_secure_password # Password to bypass proxy requirement
  4. Start the application:

    npm start

🐳 Docker Deployment

You can easily run the application using Docker:

# Build the Docker image
npm run docker:build

# Run the container
npm run docker:run

Or use the provided docker-compose.yml:

docker-compose up -d

πŸ”Œ API Endpoints

The API provides the following main endpoints:

πŸ” Health Check

GET /api/app/health

Returns detailed system information and checks if all components are working correctly.

πŸ•ΈοΈ Scraper

POST /api/scrape/start

Main endpoint for web scraping operations. Configure your scraping workflow with a detailed JSON structure.

Example Request:

πŸ“‹ View example request body
{
  "proxy": {
    "bypassCode": "your_secure_password",
    "auth": {
      "enabled": true,
      "username": "proxyuser",
      "password": "proxypass"
    },
    "servers": [
      {
        "server": "proxy1.example.com",
        "port": 8080
      },
      {
        "server": "proxy2.example.com",
        "port": 8081
      }
    ]
  },
  "record": {
    "title": "Google Search Example",
    "speedMode": "NORMAL",
    "timeoutMode": "NORMAL",
    "steps": [
      {
        "type": "navigate",
        "url": "https://www.google.com"
      },
      {
        "type": "wait",
        "value": "1000"
      },
      {
        "type": "setViewport",
        "width": 1366,
        "height": 768
      },
      {
        "type": "click",
        "selectors": [["#L2AGLb"]]
      },
      {
        "type": "change",
        "selectors": [["input[name='q']"]],
        "value": "web scraping api"
      },
      {
        "type": "click",
        "selectors": [["input[name='btnK']"]]
      },
      {
        "type": "waitForElement",
        "selectors": [["#search"]]
      }
    ]
  },
  "capture": {
    "selectors": [
      {
        "key": "search_results",
        "type": "CSS",
        "value": "#search"
      },
      {
        "key": "page_title",
        "type": "CSS",
        "value": "title"
      }
    ]
  },
  "headers": {
    "Accept-Language": "en-US,en;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
  },
  "output": {
    "screenshots": {
      "onError": true,
      "onSuccess": true
    },
    "responseType": "JSON"
  }
}

πŸ§ͺ Test Endpoint

POST /api/scrape/test

Runs a predefined scraping test using a fixed configuration. This endpoint is useful for:

  • Testing if the scraping service is working correctly
  • Checking proxy connectivity
  • Validating browser functionality

The test endpoint uses a predefined configuration from constants.js with a sample scrape request that checks your IP address using a proxied connection.

πŸ”§ System Management

POST /api/app/shutdown

Safely shuts down the application.

POST /api/os/restart

Initiates an operating system restart (requires appropriate permissions).

πŸ“š Documentation

For complete API documentation, visit the Swagger UI endpoint after starting the application:

http://localhost:3000/api/docs

πŸ” Selector Types

Data can be extracted using different selector methods:

Selector Type Usage
CSS Standard CSS selectors
XPATH XPath expressions
FULL Retrieves the full page HTML content

πŸ”„ Response Types

The scraper supports multiple response formats:

Type Description
JSON Returns structured JSON with data and metadata
RAW Returns raw content from the first selector
NONE No response content (useful for headless operations)

⚠️ Error Handling

The API implements a consistent error handling pattern:

  • Standardized Format: All errors return a consistent JSON structure
  • Contextual Information: Includes error code, message, and related data
  • Debug Support: Stack traces included in development mode
  • Visual Evidence: Error screenshots for visual debugging
  • Step Identification: Clear indication of which step in the process failed
  • Proxy Errors: Detailed information about proxy-related issues

Example error response:

{
  "success": false,
  "data": {
    "message": "Failed to execute click operation on element",
    "code": "ERROR_ELEMENT_NOT_FOUND",
    "stepIndex": 3,
    "screenshotUrl": "/tmp/error-screenshot-123456.png"
  }
}

πŸ› οΈ CLI Startup Options

The project includes several command-line utility scripts:

# Start the application
npm start

# Run tests
npm test

# Lint code
npm run lint

# Fix linting issues
npm run lint:fix

# Format code with Prettier
npm run format

# Docker operations
npm run docker:build   # Build Docker image
npm run docker:run     # Run Docker container

πŸ“ Project Structure

.
β”œβ”€β”€ index.js                # Entry point
β”œβ”€β”€ docker-compose.yml      # Docker Compose configuration
β”œβ”€β”€ Dockerfile              # Docker configuration
β”œβ”€β”€ src/                    # Application source code
β”‚   β”œβ”€β”€ app.js              # Express app configuration
β”‚   β”œβ”€β”€ config.js           # Configuration module
β”‚   β”œβ”€β”€ constants.js        # Constants and enums
β”‚   β”œβ”€β”€ controllers/        # Request handlers
β”‚   β”‚   β”œβ”€β”€ error-handler.js # Global error handling middleware
β”‚   β”‚   └── api/            # API controllers
β”‚   β”œβ”€β”€ helpers/            # Helper functions
β”‚   β”‚   β”œβ”€β”€ browser-semaphore.js   # Browser instance management
β”‚   β”‚   β”œβ”€β”€ cleanup-screenshots.js # Screenshot cleanup utility
β”‚   β”‚   β”œβ”€β”€ do-scraping.js         # Main scraping logic
β”‚   β”‚   β”œβ”€β”€ proxies-random-get-one.js # Proxy rotation utility
β”‚   β”‚   β”œβ”€β”€ scrape-validate-req-body.js # Request validation
β”‚   β”‚   └── validators.js          # Schema validation definitions
β”‚   β”œβ”€β”€ routes/             # API route definitions
β”‚   └── utils/              # Utility middleware
β”œβ”€β”€ __tests__/              # Test files
└── tmp/                    # Temporary files directory

🀝 Contributing

Contributions are welcome! Please read our Contributing Guide before submitting a Pull Request.

By participating in this project, you agree to abide by our Code of Conduct.

πŸ”’ Security

For security vulnerabilities, please see our Security Policy. Do not open a public issue for security concerns.

πŸ“„ License

This project is licensed under the ISC License - see the LICENSE file for details.

About

A powerful and flexible web scraping API built with Express.js and Puppeteer

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors