Skip to content

Latest commit

 

History

History
154 lines (116 loc) · 4.14 KB

File metadata and controls

154 lines (116 loc) · 4.14 KB

Getting Started

Prerequisites

  • Node.js 18 or later
  • npm or yarn

Installation

npm install html-data-scraper

This installs Playwright as a dependency. On first use, Playwright will download a Chromium browser automatically. You can also install it explicitly:

npx playwright install chromium

Your first scraper

import htmlDataScraper from 'html-data-scraper';

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
]);

// By default, pageData contains the full HTML content of the page
console.log(results[0].pageData.substring(0, 100));

await browserInstance.close();

Extracting data with CSS selectors

Instead of writing JavaScript evaluation functions, pass a CSS selector string to extract the textContent of the first matching element:

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
], {
    onEvaluateForEachUrl: {
        heading: '#firstHeading',
        firstParagraph: '.mw-parser-output > p:not(.mw-empty-elt)',
    },
});

console.log(results[0].evaluates);
// {
//     heading: 'Web scraping',
//     firstParagraph: 'Web scraping, web harvesting, or web data extraction...'
// }

await browserInstance.close();

Mixing selectors and functions

You can mix CSS selector strings with custom evaluation functions in the same config:

const { results, browserInstance } = await htmlDataScraper([
    'https://en.wikipedia.org/wiki/Web_scraping',
], {
    onEvaluateForEachUrl: {
        heading: '#firstHeading',                       // CSS selector
        linkCount: () => document.querySelectorAll('a').length,  // function
    },
});

console.log(results[0].evaluates);
// { heading: 'Web scraping', linkCount: 342 }

await browserInstance.close();

Scraping multiple URLs concurrently

The library automatically distributes URLs across multiple browser tabs (one per CPU core minus one). You can also set this manually:

const { results, browserInstance } = await htmlDataScraper([
    'https://example.com/page-1',
    'https://example.com/page-2',
    'https://example.com/page-3',
    'https://example.com/page-4',
    'https://example.com/page-5',
], {
    maxSimultaneousPages: 3,
    onEvaluateForEachUrl: {
        title: 'h1',
    },
    onProgress: (done, total, pageIndex) => {
        console.log(`Page ${pageIndex}: ${done}/${total}`);
    },
});

// results is an array of 5 PageResult objects, in order
await browserInstance.close();

Taking screenshots

Use the beforeGoToUrl and onPageLoadedForEachUrl hooks for advanced operations like screenshots:

const { results, browserInstance } = await htmlDataScraper([
    'https://example.com',
], {
    beforeGoToUrl: async (page) => {
        await page.setViewportSize({ width: 1280, height: 720 });
    },
    onPageLoadedForEachUrl: async (page, url) => {
        await page.screenshot({ path: 'screenshot.png' });
        return null;
    },
});

await browserInstance.close();

Intercepting requests

Use onRoute to intercept and modify network requests (e.g., block images for faster scraping):

const { results, browserInstance } = await htmlDataScraper([
    'https://example.com',
], {
    onRoute: (route, request) => {
        if (request.resourceType() === 'image') {
            route.abort();
        } else {
            route.continue();
        }
    },
});

await browserInstance.close();

Full examples

For complete, runnable projects that combine multiple features, see the examples/ folder:

  • Wikipedia Scraper -- CSS selectors, functions, concurrent tabs, route interception, and progress tracking
  • News Monitor -- stealth, rate limiting, retries, screenshots, and graceful error collection across 5 news sites

Next steps

  • API Reference -- full documentation of all options and types
  • Stealth -- configure anti-detection features
  • Resilience -- retries, rate limiting, and error handling