html-data-scraper

A resilient, stealth-enabled web scraper built on Playwright. Scrape data from multiple URLs concurrently using declarative CSS selectors or custom JavaScript evaluation, with built-in anti-detection and automatic retries.

Features

Playwright-powered -- uses Chromium via Playwright for reliable, modern browser automation
Concurrent scraping -- distributes URLs across multiple browser tabs automatically
Declarative CSS selectors -- extract data using simple selector strings alongside custom functions
Stealth by default -- randomized user agents, viewports, and human-like delays
Resilient crawling -- automatic retries with exponential backoff, rate limiting, and error collection
TypeScript-first -- fully typed API with strict null checks

Quick start

npm install html-data-scraper

import htmlDataScraper from 'html-data-scraper';

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
], {
    onEvaluateForEachUrl: {
        heading: '#firstHeading',                  // CSS selector -> textContent
        title: () => document.title,               // function -> page.evaluate()
    },
});

console.log(results[0].evaluates);
// { heading: 'Web scraping', title: 'Web scraping - Wikipedia' }

await browserInstance.close();

Documentation

Guide	Description
Getting Started	Installation, basic usage, and first scraper
API Reference	Full API documentation with all types and options
Stealth	Anti-detection features and configuration
Resilience	Retries, rate limiting, and error handling
Migration from v1	Upgrading from v1 (Puppeteer) to v2 (Playwright)
Contributing	Development setup and contribution guidelines

Examples

Ready-to-run example projects in the examples/ folder:

Example	Description
Wikipedia Scraper	Scrape structured data from 6 Wikipedia articles across 3 concurrent tabs using CSS selectors, functions, route interception, and progress tracking
News Monitor	Monitor headlines from 5 international news sites with stealth, rate limiting, retries, screenshots, and graceful error handling

cd examples/wikipedia-scraper
npm install && npx playwright install chromium
npm start

Example: scrape with error tolerance

const { results } = await htmlDataScraper([
    'https://example.com/page-1',
    'https://example.com/page-2',
    'https://this-will-fail.invalid',
], {
    onEvaluateForEachUrl: {
        title: 'h1',
    },
    resilience: {
        retries: 2,
        continueOnError: true,
    },
});

// Failed URLs have an error field instead of crashing the batch
for (const result of results) {
    if (result.error) {
        console.error(`Failed: ${result.url} - ${result.error}`);
    } else {
        console.log(`${result.url}: ${result.evaluates?.title}`);
    }
}

License

MIT -- Ravidhu Dissanayake

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig-cjs.json		tsconfig-cjs.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

html-data-scraper

Features

Quick start

Documentation

Examples

Example: scrape with error tolerance

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

html-data-scraper

Features

Quick start

Documentation

Examples

Example: scrape with error tolerance

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages