Getting Started

Prerequisites

Node.js 18 or later
npm or yarn

Installation

npm install html-data-scraper

This installs Playwright as a dependency. On first use, Playwright will download a Chromium browser automatically. You can also install it explicitly:

npx playwright install chromium

Your first scraper

import htmlDataScraper from 'html-data-scraper';

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
]);

// By default, pageData contains the full HTML content of the page
console.log(results[0].pageData.substring(0, 100));

await browserInstance.close();

Extracting data with CSS selectors

Instead of writing JavaScript evaluation functions, pass a CSS selector string to extract the textContent of the first matching element:

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
], {
    onEvaluateForEachUrl: {
        heading: '#firstHeading',
        firstParagraph: '.mw-parser-output > p:not(.mw-empty-elt)',
    },
});

console.log(results[0].evaluates);
// {
//     heading: 'Web scraping',
//     firstParagraph: 'Web scraping, web harvesting, or web data extraction...'
// }

await browserInstance.close();

Mixing selectors and functions

You can mix CSS selector strings with custom evaluation functions in the same config:

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
], {
    onEvaluateForEachUrl: {
        heading: '#firstHeading',                       // CSS selector
        linkCount: () => document.querySelectorAll('a').length,  // function
    },
});

console.log(results[0].evaluates);
// { heading: 'Web scraping', linkCount: 342 }

await browserInstance.close();

Scraping multiple URLs concurrently

The library automatically distributes URLs across multiple browser tabs (one per CPU core minus one). You can also set this manually:

const { results, browserInstance } = await htmlDataScraper([
    'https://example.com/page-1',
    'https://example.com/page-2',
    'https://example.com/page-3',
    'https://example.com/page-4',
    'https://example.com/page-5',
], {
    maxSimultaneousPages: 3,
    onEvaluateForEachUrl: {
        title: 'h1',
    },
    onProgress: (done, total, pageIndex) => {
        console.log(`Page ${pageIndex}: ${done}/${total}`);
    },
});

// results is an array of 5 PageResult objects, in order
await browserInstance.close();

Taking screenshots

Use the beforeGoToUrl and onPageLoadedForEachUrl hooks for advanced operations like screenshots:

const { results, browserInstance } = await htmlDataScraper([
    'https://example.com',
], {
    beforeGoToUrl: async (page) => {
        await page.setViewportSize({ width: 1280, height: 720 });
    },
    onPageLoadedForEachUrl: async (page, url) => {
        await page.screenshot({ path: 'screenshot.png' });
        return null;
    },
});

await browserInstance.close();

Intercepting requests

Use onRoute to intercept and modify network requests (e.g., block images for faster scraping):

const { results, browserInstance } = await htmlDataScraper([
    'https://example.com',
], {
    onRoute: (route, request) => {
        if (request.resourceType() === 'image') {
            route.abort();
        } else {
            route.continue();
        }
    },
});

await browserInstance.close();

Full examples

For complete, runnable projects that combine multiple features, see the examples/ folder:

Wikipedia Scraper -- CSS selectors, functions, concurrent tabs, route interception, and progress tracking
News Monitor -- stealth, rate limiting, retries, screenshots, and graceful error collection across 5 news sites

Next steps

API Reference -- full documentation of all options and types
Stealth -- configure anti-detection features
Resilience -- retries, rate limiting, and error handling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Prerequisites

Installation

Your first scraper

Extracting data with CSS selectors

Mixing selectors and functions

Scraping multiple URLs concurrently

Taking screenshots

Intercepting requests

Full examples

Next steps

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started

Prerequisites

Installation

Your first scraper

Extracting data with CSS selectors

Mixing selectors and functions

Scraping multiple URLs concurrently

Taking screenshots

Intercepting requests

Full examples

Next steps