Skip to content

ravidhu/html-data-scraper

Repository files navigation

html-data-scraper

npm version License: MIT

A resilient, stealth-enabled web scraper built on Playwright. Scrape data from multiple URLs concurrently using declarative CSS selectors or custom JavaScript evaluation, with built-in anti-detection and automatic retries.

Features

  • Playwright-powered -- uses Chromium via Playwright for reliable, modern browser automation
  • Concurrent scraping -- distributes URLs across multiple browser tabs automatically
  • Declarative CSS selectors -- extract data using simple selector strings alongside custom functions
  • Stealth by default -- randomized user agents, viewports, and human-like delays
  • Resilient crawling -- automatic retries with exponential backoff, rate limiting, and error collection
  • TypeScript-first -- fully typed API with strict null checks

Quick start

npm install html-data-scraper
import htmlDataScraper from 'html-data-scraper';

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
], {
    onEvaluateForEachUrl: {
        heading: '#firstHeading',                  // CSS selector -> textContent
        title: () => document.title,               // function -> page.evaluate()
    },
});

console.log(results[0].evaluates);
// { heading: 'Web scraping', title: 'Web scraping - Wikipedia' }

await browserInstance.close();

Documentation

Guide Description
Getting Started Installation, basic usage, and first scraper
API Reference Full API documentation with all types and options
Stealth Anti-detection features and configuration
Resilience Retries, rate limiting, and error handling
Migration from v1 Upgrading from v1 (Puppeteer) to v2 (Playwright)
Contributing Development setup and contribution guidelines

Examples

Ready-to-run example projects in the examples/ folder:

Example Description
Wikipedia Scraper Scrape structured data from 6 Wikipedia articles across 3 concurrent tabs using CSS selectors, functions, route interception, and progress tracking
News Monitor Monitor headlines from 5 international news sites with stealth, rate limiting, retries, screenshots, and graceful error handling
cd examples/wikipedia-scraper
npm install && npx playwright install chromium
npm start

Example: scrape with error tolerance

const { results } = await htmlDataScraper([
    'https://example.com/page-1',
    'https://example.com/page-2',
    'https://this-will-fail.invalid',
], {
    onEvaluateForEachUrl: {
        title: 'h1',
    },
    resilience: {
        retries: 2,
        continueOnError: true,
    },
});

// Failed URLs have an error field instead of crashing the batch
for (const result of results) {
    if (result.error) {
        console.error(`Failed: ${result.url} - ${result.error}`);
    } else {
        console.log(`${result.url}: ${result.evaluates?.title}`);
    }
}

License

MIT -- Ravidhu Dissanayake

About

A resilient, stealth-enabled web scraper built on Playwright. Scrape data from multiple URLs concurrently using declarative CSS selectors or custom JavaScript evaluation, with built-in anti-detection and automatic retries.

Resources

License

Contributing

Stars

Watchers

Forks

Contributors