A resilient, stealth-enabled web scraper built on Playwright. Scrape data from multiple URLs concurrently using declarative CSS selectors or custom JavaScript evaluation, with built-in anti-detection and automatic retries.
- Playwright-powered -- uses Chromium via Playwright for reliable, modern browser automation
- Concurrent scraping -- distributes URLs across multiple browser tabs automatically
- Declarative CSS selectors -- extract data using simple selector strings alongside custom functions
- Stealth by default -- randomized user agents, viewports, and human-like delays
- Resilient crawling -- automatic retries with exponential backoff, rate limiting, and error collection
- TypeScript-first -- fully typed API with strict null checks
npm install html-data-scraperimport htmlDataScraper from 'html-data-scraper';
const { results, browserInstance } = await htmlDataScraper([
'https://en.wikipedia.org/wiki/Web_scraping',
], {
onEvaluateForEachUrl: {
heading: '#firstHeading', // CSS selector -> textContent
title: () => document.title, // function -> page.evaluate()
},
});
console.log(results[0].evaluates);
// { heading: 'Web scraping', title: 'Web scraping - Wikipedia' }
await browserInstance.close();| Guide | Description |
|---|---|
| Getting Started | Installation, basic usage, and first scraper |
| API Reference | Full API documentation with all types and options |
| Stealth | Anti-detection features and configuration |
| Resilience | Retries, rate limiting, and error handling |
| Migration from v1 | Upgrading from v1 (Puppeteer) to v2 (Playwright) |
| Contributing | Development setup and contribution guidelines |
Ready-to-run example projects in the examples/ folder:
| Example | Description |
|---|---|
| Wikipedia Scraper | Scrape structured data from 6 Wikipedia articles across 3 concurrent tabs using CSS selectors, functions, route interception, and progress tracking |
| News Monitor | Monitor headlines from 5 international news sites with stealth, rate limiting, retries, screenshots, and graceful error handling |
cd examples/wikipedia-scraper
npm install && npx playwright install chromium
npm startconst { results } = await htmlDataScraper([
'https://example.com/page-1',
'https://example.com/page-2',
'https://this-will-fail.invalid',
], {
onEvaluateForEachUrl: {
title: 'h1',
},
resilience: {
retries: 2,
continueOnError: true,
},
});
// Failed URLs have an error field instead of crashing the batch
for (const result of results) {
if (result.error) {
console.error(`Failed: ${result.url} - ${result.error}`);
} else {
console.log(`${result.url}: ${result.evaluates?.title}`);
}
}MIT -- Ravidhu Dissanayake