API Reference

`htmlDataScraper(urls, configurations?, customBrowser?)`

The main entry point for scraping.

Parameters

Parameter	Type	Default	Description
`urls`	`string[]`	required	Array of URLs to scrape
`configurations`	`CustomConfigurations \| null`	`null`	Configuration options (see below)
`customBrowser`	`Browser \| null`	`null`	A Playwright `Browser` instance to reuse

Returns

Promise<{ results: PageResult[]; browserInstance: Browser }>

results -- an array of PageResult objects, one per URL, in the same order as the input
browserInstance -- the Playwright Browser instance used (close it when done)

Errors

Error message	Cause
`urlListIsEmpty`	Empty `urls` array was provided
`maxSimultaneousPagesNotSet`	`maxSimultaneousPages` was explicitly set to `0` or a falsy value

`CustomConfigurations`

All fields are optional.

Core options

Option	Type	Default	Description
`maxSimultaneousPages`	`number`	CPU cores - 1	Number of concurrent browser tabs
`additionalWaitSeconds`	`number`	`1`	Extra wait time after page load (seconds)

Playwright options

playwrightOptions?: {
    launch?: LaunchOptions;
    context?: BrowserContextOptions;
    navigationWaitUntil?: 'load' | 'domcontentloaded' | 'networkidle' | 'commit';
}

Option	Default	Description
`launch`	`{ headless: true }`	Passed to `chromium.launch()`. See Playwright LaunchOptions.
`context`	`{}`	Passed to `browser.newContext()`. See Playwright BrowserContextOptions.
`navigationWaitUntil`	`'networkidle'`	When to consider navigation complete. Options: `'load'`, `'domcontentloaded'`, `'networkidle'`, `'commit'`.

Stealth options

See Stealth documentation for details.

stealth?: StealthOptions;

Option	Type	Default	Description
`enabled`	`boolean`	`true`	Enable stealth features
`randomizeUserAgent`	`boolean`	`true`	Use a random user agent per context
`randomizeViewport`	`boolean`	`true`	Use a random common viewport size
`humanLikeDelays`	`boolean`	`true`	Add random delays between navigations
`userAgents`	`string[]`	built-in list	Custom user agent strings to pick from

Resilience options

See Resilience documentation for details.

resilience?: ResilienceOptions;

Option	Type	Default	Description
`retries`	`number`	`2`	Number of retry attempts per URL
`retryDelayMs`	`number`	`1000`	Base delay for exponential backoff (ms)
`timeoutMs`	`number`	`30000`	Navigation timeout per page (ms)
`continueOnError`	`boolean`	`true`	Collect errors instead of throwing
`rateLimitPerSecond`	`number`	`0`	Max requests per second (0 = unlimited)

Hooks

`beforeGoToUrl`

beforeGoToUrl?: (page: Page) => Promise<void>;

Called before each URL is loaded. Use it to configure the page (e.g., set viewport, inject cookies).

`onRoute`

onRoute?: (route: Route, request: Request) => void;

Called for every network request the page makes. Use it to block resources, modify headers, or mock responses. Replaces Puppeteer's setRequestInterception pattern.

You must call either route.continue(), route.abort(), or route.fulfill() for each request.

`onPageLoadedForEachUrl`

onPageLoadedForEachUrl?: (page: Page, currentUrl: string) => any;

Called after each page finishes loading. The return value is stored in PageResult.pageData. If not set, pageData defaults to the full HTML content via page.content().

`onProgress`

onProgress?: (resultNumber: number, totalNumber: number, internalPageIndex: number) => void;

Called after each URL finishes processing within a page tab.

resultNumber -- number of URLs processed so far in this tab (1-indexed)
totalNumber -- total URLs assigned to this tab
internalPageIndex -- the tab index (0-indexed)

Data extraction

`onEvaluateForEachUrl`

onEvaluateForEachUrl?: {
    [key: string]: string | (() => any);
};

Define named extractors that run against each page. Each key becomes a field in PageResult.evaluates.

Two value types:

Type	Behavior	Example
`string`	CSS selector -- returns `textContent` of the first match	`'h1'`, `'.price'`, `'#title'`
`() => any`	Function -- runs via `page.evaluate()` in the browser context	`() => document.title`

onEvaluateForEachUrl: {
    heading: 'h1',                              // CSS selector
    price: '.product-price',                    // CSS selector
    linkCount: () => document.links.length,     // function
}

`PageResult`

interface PageResult {
    url: string;
    pageData: any;
    evaluates: null | { [k: string]: any };
    error?: string | null;
}

Field	Type	Description
`url`	`string`	The URL that was scraped
`pageData`	`any`	Full HTML content, or the return value of `onPageLoadedForEachUrl`
`evaluates`	`object \| null`	Results of `onEvaluateForEachUrl` extractors, keyed by name
`error`	`string \| undefined`	Error message if the URL failed (only when `continueOnError: true`)

`initBrowser(configuration)`

Lower-level export for managing browser instances manually.

import { initBrowser } from 'html-data-scraper';

const browser = await initBrowser({
    playwrightOptions: {
        launch: { headless: false },
    },
});

Returns a cached Browser instance. Calling with the same config returns the same instance. Calling with different config creates a new one.

Exported types

All types are exported from the main module:

import type {
    CustomConfigurations,
    StealthOptions,
    ResilienceOptions,
    EvaluateValue,
    OnProgressFunction,
    OnPageLoadedFunction,
    OnRouteHandler,
    PageResult,
} from 'html-data-scraper';

Full examples

For complete, runnable projects that demonstrate the API in practice, see the examples/ folder:

Wikipedia Scraper -- uses onEvaluateForEachUrl (mixed selectors and functions), onRoute, onProgress, and maxSimultaneousPages
News Monitor -- uses stealth, resilience (rate limiting, retries, continueOnError), onPageLoadedForEachUrl, onRoute, and beforeGoToUrl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Reference

`htmlDataScraper(urls, configurations?, customBrowser?)`

Parameters

Returns

Errors

`CustomConfigurations`

Core options

Playwright options

Stealth options

Resilience options

Hooks

`beforeGoToUrl`

`onRoute`

`onPageLoadedForEachUrl`

`onProgress`

Data extraction

`onEvaluateForEachUrl`

`PageResult`

`initBrowser(configuration)`

Exported types

Full examples

FilesExpand file tree

api-reference.md

Latest commit

History

api-reference.md

File metadata and controls

API Reference

htmlDataScraper(urls, configurations?, customBrowser?)

Parameters

Returns

Errors

CustomConfigurations

Core options

Playwright options

Stealth options

Resilience options

Hooks

beforeGoToUrl

onRoute

onPageLoadedForEachUrl

onProgress

Data extraction

onEvaluateForEachUrl

PageResult

initBrowser(configuration)

Exported types

Full examples

`htmlDataScraper(urls, configurations?, customBrowser?)`

`CustomConfigurations`

`beforeGoToUrl`

`onRoute`

`onPageLoadedForEachUrl`

`onProgress`

`onEvaluateForEachUrl`

`PageResult`

`initBrowser(configuration)`