Skip to content

Latest commit

 

History

History
221 lines (156 loc) · 6.72 KB

File metadata and controls

221 lines (156 loc) · 6.72 KB

API Reference

htmlDataScraper(urls, configurations?, customBrowser?)

The main entry point for scraping.

Parameters

Parameter Type Default Description
urls string[] required Array of URLs to scrape
configurations CustomConfigurations | null null Configuration options (see below)
customBrowser Browser | null null A Playwright Browser instance to reuse

Returns

Promise<{ results: PageResult[]; browserInstance: Browser }>

  • results -- an array of PageResult objects, one per URL, in the same order as the input
  • browserInstance -- the Playwright Browser instance used (close it when done)

Errors

Error message Cause
urlListIsEmpty Empty urls array was provided
maxSimultaneousPagesNotSet maxSimultaneousPages was explicitly set to 0 or a falsy value

CustomConfigurations

All fields are optional.

Core options

Option Type Default Description
maxSimultaneousPages number CPU cores - 1 Number of concurrent browser tabs
additionalWaitSeconds number 1 Extra wait time after page load (seconds)

Playwright options

playwrightOptions?: {
    launch?: LaunchOptions;
    context?: BrowserContextOptions;
    navigationWaitUntil?: 'load' | 'domcontentloaded' | 'networkidle' | 'commit';
}
Option Default Description
launch { headless: true } Passed to chromium.launch(). See Playwright LaunchOptions.
context {} Passed to browser.newContext(). See Playwright BrowserContextOptions.
navigationWaitUntil 'networkidle' When to consider navigation complete. Options: 'load', 'domcontentloaded', 'networkidle', 'commit'.

Stealth options

See Stealth documentation for details.

stealth?: StealthOptions;
Option Type Default Description
enabled boolean true Enable stealth features
randomizeUserAgent boolean true Use a random user agent per context
randomizeViewport boolean true Use a random common viewport size
humanLikeDelays boolean true Add random delays between navigations
userAgents string[] built-in list Custom user agent strings to pick from

Resilience options

See Resilience documentation for details.

resilience?: ResilienceOptions;
Option Type Default Description
retries number 2 Number of retry attempts per URL
retryDelayMs number 1000 Base delay for exponential backoff (ms)
timeoutMs number 30000 Navigation timeout per page (ms)
continueOnError boolean true Collect errors instead of throwing
rateLimitPerSecond number 0 Max requests per second (0 = unlimited)

Hooks

beforeGoToUrl

beforeGoToUrl?: (page: Page) => Promise<void>;

Called before each URL is loaded. Use it to configure the page (e.g., set viewport, inject cookies).

onRoute

onRoute?: (route: Route, request: Request) => void;

Called for every network request the page makes. Use it to block resources, modify headers, or mock responses. Replaces Puppeteer's setRequestInterception pattern.

You must call either route.continue(), route.abort(), or route.fulfill() for each request.

onPageLoadedForEachUrl

onPageLoadedForEachUrl?: (page: Page, currentUrl: string) => any;

Called after each page finishes loading. The return value is stored in PageResult.pageData. If not set, pageData defaults to the full HTML content via page.content().

onProgress

onProgress?: (resultNumber: number, totalNumber: number, internalPageIndex: number) => void;

Called after each URL finishes processing within a page tab.

  • resultNumber -- number of URLs processed so far in this tab (1-indexed)
  • totalNumber -- total URLs assigned to this tab
  • internalPageIndex -- the tab index (0-indexed)

Data extraction

onEvaluateForEachUrl

onEvaluateForEachUrl?: {
    [key: string]: string | (() => any);
};

Define named extractors that run against each page. Each key becomes a field in PageResult.evaluates.

Two value types:

Type Behavior Example
string CSS selector -- returns textContent of the first match 'h1', '.price', '#title'
() => any Function -- runs via page.evaluate() in the browser context () => document.title
onEvaluateForEachUrl: {
    heading: 'h1',                              // CSS selector
    price: '.product-price',                    // CSS selector
    linkCount: () => document.links.length,     // function
}

PageResult

interface PageResult {
    url: string;
    pageData: any;
    evaluates: null | { [k: string]: any };
    error?: string | null;
}
Field Type Description
url string The URL that was scraped
pageData any Full HTML content, or the return value of onPageLoadedForEachUrl
evaluates object | null Results of onEvaluateForEachUrl extractors, keyed by name
error string | undefined Error message if the URL failed (only when continueOnError: true)

initBrowser(configuration)

Lower-level export for managing browser instances manually.

import { initBrowser } from 'html-data-scraper';

const browser = await initBrowser({
    playwrightOptions: {
        launch: { headless: false },
    },
});

Returns a cached Browser instance. Calling with the same config returns the same instance. Calling with different config creates a new one.


Exported types

All types are exported from the main module:

import type {
    CustomConfigurations,
    StealthOptions,
    ResilienceOptions,
    EvaluateValue,
    OnProgressFunction,
    OnPageLoadedFunction,
    OnRouteHandler,
    PageResult,
} from 'html-data-scraper';

Full examples

For complete, runnable projects that demonstrate the API in practice, see the examples/ folder:

  • Wikipedia Scraper -- uses onEvaluateForEachUrl (mixed selectors and functions), onRoute, onProgress, and maxSimultaneousPages
  • News Monitor -- uses stealth, resilience (rate limiting, retries, continueOnError), onPageLoadedForEachUrl, onRoute, and beforeGoToUrl