- Node.js 18 or later
- npm or yarn
npm install html-data-scraperThis installs Playwright as a dependency. On first use, Playwright will download a Chromium browser automatically. You can also install it explicitly:
npx playwright install chromiumimport htmlDataScraper from 'html-data-scraper';
const { results, browserInstance } = await htmlDataScraper([
'https://en.wikipedia.org/wiki/Web_scraping',
]);
// By default, pageData contains the full HTML content of the page
console.log(results[0].pageData.substring(0, 100));
await browserInstance.close();Instead of writing JavaScript evaluation functions, pass a CSS selector string to extract the textContent of the first matching element:
const { results, browserInstance } = await htmlDataScraper([
'https://en.wikipedia.org/wiki/Web_scraping',
], {
onEvaluateForEachUrl: {
heading: '#firstHeading',
firstParagraph: '.mw-parser-output > p:not(.mw-empty-elt)',
},
});
console.log(results[0].evaluates);
// {
// heading: 'Web scraping',
// firstParagraph: 'Web scraping, web harvesting, or web data extraction...'
// }
await browserInstance.close();You can mix CSS selector strings with custom evaluation functions in the same config:
const { results, browserInstance } = await htmlDataScraper([
'https://en.wikipedia.org/wiki/Web_scraping',
], {
onEvaluateForEachUrl: {
heading: '#firstHeading', // CSS selector
linkCount: () => document.querySelectorAll('a').length, // function
},
});
console.log(results[0].evaluates);
// { heading: 'Web scraping', linkCount: 342 }
await browserInstance.close();The library automatically distributes URLs across multiple browser tabs (one per CPU core minus one). You can also set this manually:
const { results, browserInstance } = await htmlDataScraper([
'https://example.com/page-1',
'https://example.com/page-2',
'https://example.com/page-3',
'https://example.com/page-4',
'https://example.com/page-5',
], {
maxSimultaneousPages: 3,
onEvaluateForEachUrl: {
title: 'h1',
},
onProgress: (done, total, pageIndex) => {
console.log(`Page ${pageIndex}: ${done}/${total}`);
},
});
// results is an array of 5 PageResult objects, in order
await browserInstance.close();Use the beforeGoToUrl and onPageLoadedForEachUrl hooks for advanced operations like screenshots:
const { results, browserInstance } = await htmlDataScraper([
'https://example.com',
], {
beforeGoToUrl: async (page) => {
await page.setViewportSize({ width: 1280, height: 720 });
},
onPageLoadedForEachUrl: async (page, url) => {
await page.screenshot({ path: 'screenshot.png' });
return null;
},
});
await browserInstance.close();Use onRoute to intercept and modify network requests (e.g., block images for faster scraping):
const { results, browserInstance } = await htmlDataScraper([
'https://example.com',
], {
onRoute: (route, request) => {
if (request.resourceType() === 'image') {
route.abort();
} else {
route.continue();
}
},
});
await browserInstance.close();For complete, runnable projects that combine multiple features, see the examples/ folder:
- Wikipedia Scraper -- CSS selectors, functions, concurrent tabs, route interception, and progress tracking
- News Monitor -- stealth, rate limiting, retries, screenshots, and graceful error collection across 5 news sites
- API Reference -- full documentation of all options and types
- Stealth -- configure anti-detection features
- Resilience -- retries, rate limiting, and error handling