Website β’ API Documentation β’ Installation β’ Features β’ Contributing
Scrapereq is a RESTful API service that allows you to perform web scraping operations by defining a series of steps executed by a headless browser. It provides a clean and secure way to extract data from websites with advanced features like proxy support, customizable scraping speeds, robust validation, and error handling.
- π Step-Based Scraping: Define your scraping workflow as a series of steps (navigate, click, wait, setViewport, etc.)
- β‘ Speed Control: Multiple speed modes (TURBO, FAST, NORMAL, SLOW, SLOWEST, CRAWL, STEALTH)
- π Selector Support: Extract data using CSS, XPath, or full page HTML selectors
- β Enhanced Validation: Comprehensive request validation with clear error messages
- π Built-in Security: Basic authentication, helmet protection, and CORS configuration
- π Enhanced Proxy Support: Advanced proxy configuration with authentication and multiple proxy rotation
- π‘οΈ Error Handling: Consistent JSON error responses with contextual details and optional stack traces for debugging
- πͺ Browser Resilience: Automatic disconnection detection and resource management
- πΈ Screenshot Capabilities: Capture success and error screenshots with configurable options
- π API Monitoring: Detailed health check endpoint with system information
- π Swagger Documentation: Interactive API documentation with detailed request/response examples
- π§ System Controls: Application shutdown and OS restart endpoints
- πΎ Persistent Storage: Configurable screenshot directory for persistent storage across deployments
- π§Ή Automatic Cleanup: Automated cleanup of old screenshot files
- π Performance Metrics: Track and analyze scraping performance with detailed metrics
- π Retry Mechanism: Intelligent retry functionality for handling transient errors
- π οΈ CLI Utilities: User-friendly command-line interface for development and deployment
- π¦ Node.js: JavaScript runtime
- π Express.js v5.1.0: Web application framework
- π€ Puppeteer v24.8.0: Headless Chrome browser automation
- π§© Puppeteer-Extra v3.3.6: Plugin system for Puppeteer
- βΊοΈ @puppeteer/replay v3.1.1: Record and replay browser interactions
- β Joi v17.13.3: Request validation
- π Morgan: HTTP request logging
- π‘οΈ Helmet v8.1.0: Security middleware
- π Swagger-JSDoc v6.2.8: API documentation generation
- π Swagger-UI-Express v5.0.1: Interactive API documentation
- π CORS: Cross-Origin Resource Sharing support
- βοΈ dotenv v16.5.0: Environment configuration
- π§ͺ Jest & Supertest: Testing framework
-
Clone the repository:
git clone https://github.com/erdinccurebal/scrapereq.git cd scrapereq -
Install dependencies:
npm install
-
Create a configuration file:
Create a
.envfile in the root directory based on the following template:# Server Configuration PORT=3000 HOST=localhost NODE_ENV=development WEB_ADDRESS=http://localhost:3000 # Authentication AUTH_USERNAME=admin AUTH_PASSWORD=secretpassword # Puppeteer Configuration CHROME_PATH=/path/to/chrome # Optional custom Chrome path # File Storage TMP_DIR=/path/to/persistent/directory # Optional: defaults to ./tmp # Browser Concurrency MAX_CONCURRENT_BROWSERS=2 # Number of concurrent browser instances # Rate Limiting RATE_LIMIT_WINDOW_MS=900000 # 15 minutes in milliseconds RATE_LIMIT_MAX_REQUESTS=100 # Maximum requests per window # Proxy Configuration (Optional) SCRAPE_PROXY_BYPASS_CODE=your_secure_password # Password to bypass proxy requirement
-
Start the application:
npm start
You can easily run the application using Docker:
# Build the Docker image
npm run docker:build
# Run the container
npm run docker:runOr use the provided docker-compose.yml:
docker-compose up -dThe API provides the following main endpoints:
GET /api/app/healthReturns detailed system information and checks if all components are working correctly.
POST /api/scrape/startMain endpoint for web scraping operations. Configure your scraping workflow with a detailed JSON structure.
π View example request body
{
"proxy": {
"bypassCode": "your_secure_password",
"auth": {
"enabled": true,
"username": "proxyuser",
"password": "proxypass"
},
"servers": [
{
"server": "proxy1.example.com",
"port": 8080
},
{
"server": "proxy2.example.com",
"port": 8081
}
]
},
"record": {
"title": "Google Search Example",
"speedMode": "NORMAL",
"timeoutMode": "NORMAL",
"steps": [
{
"type": "navigate",
"url": "https://www.google.com"
},
{
"type": "wait",
"value": "1000"
},
{
"type": "setViewport",
"width": 1366,
"height": 768
},
{
"type": "click",
"selectors": [["#L2AGLb"]]
},
{
"type": "change",
"selectors": [["input[name='q']"]],
"value": "web scraping api"
},
{
"type": "click",
"selectors": [["input[name='btnK']"]]
},
{
"type": "waitForElement",
"selectors": [["#search"]]
}
]
},
"capture": {
"selectors": [
{
"key": "search_results",
"type": "CSS",
"value": "#search"
},
{
"key": "page_title",
"type": "CSS",
"value": "title"
}
]
},
"headers": {
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
},
"output": {
"screenshots": {
"onError": true,
"onSuccess": true
},
"responseType": "JSON"
}
}POST /api/scrape/testRuns a predefined scraping test using a fixed configuration. This endpoint is useful for:
- Testing if the scraping service is working correctly
- Checking proxy connectivity
- Validating browser functionality
The test endpoint uses a predefined configuration from constants.js with a sample scrape request that checks your IP address using a proxied connection.
POST /api/app/shutdownSafely shuts down the application.
POST /api/os/restartInitiates an operating system restart (requires appropriate permissions).
For complete API documentation, visit the Swagger UI endpoint after starting the application:
http://localhost:3000/api/docs
Data can be extracted using different selector methods:
| Selector Type | Usage |
|---|---|
CSS |
Standard CSS selectors |
XPATH |
XPath expressions |
FULL |
Retrieves the full page HTML content |
The scraper supports multiple response formats:
| Type | Description |
|---|---|
JSON |
Returns structured JSON with data and metadata |
RAW |
Returns raw content from the first selector |
NONE |
No response content (useful for headless operations) |
The API implements a consistent error handling pattern:
- Standardized Format: All errors return a consistent JSON structure
- Contextual Information: Includes error code, message, and related data
- Debug Support: Stack traces included in development mode
- Visual Evidence: Error screenshots for visual debugging
- Step Identification: Clear indication of which step in the process failed
- Proxy Errors: Detailed information about proxy-related issues
Example error response:
{
"success": false,
"data": {
"message": "Failed to execute click operation on element",
"code": "ERROR_ELEMENT_NOT_FOUND",
"stepIndex": 3,
"screenshotUrl": "/tmp/error-screenshot-123456.png"
}
}The project includes several command-line utility scripts:
# Start the application
npm start
# Run tests
npm test
# Lint code
npm run lint
# Fix linting issues
npm run lint:fix
# Format code with Prettier
npm run format
# Docker operations
npm run docker:build # Build Docker image
npm run docker:run # Run Docker container.
βββ index.js # Entry point
βββ docker-compose.yml # Docker Compose configuration
βββ Dockerfile # Docker configuration
βββ src/ # Application source code
β βββ app.js # Express app configuration
β βββ config.js # Configuration module
β βββ constants.js # Constants and enums
β βββ controllers/ # Request handlers
β β βββ error-handler.js # Global error handling middleware
β β βββ api/ # API controllers
β βββ helpers/ # Helper functions
β β βββ browser-semaphore.js # Browser instance management
β β βββ cleanup-screenshots.js # Screenshot cleanup utility
β β βββ do-scraping.js # Main scraping logic
β β βββ proxies-random-get-one.js # Proxy rotation utility
β β βββ scrape-validate-req-body.js # Request validation
β β βββ validators.js # Schema validation definitions
β βββ routes/ # API route definitions
β βββ utils/ # Utility middleware
βββ __tests__/ # Test files
βββ tmp/ # Temporary files directory
Contributions are welcome! Please read our Contributing Guide before submitting a Pull Request.
By participating in this project, you agree to abide by our Code of Conduct.
For security vulnerabilities, please see our Security Policy. Do not open a public issue for security concerns.
This project is licensed under the ISC License - see the LICENSE file for details.