The main entry point for scraping.
| Parameter | Type | Default | Description |
|---|---|---|---|
urls |
string[] |
required | Array of URLs to scrape |
configurations |
CustomConfigurations | null |
null |
Configuration options (see below) |
customBrowser |
Browser | null |
null |
A Playwright Browser instance to reuse |
Promise<{ results: PageResult[]; browserInstance: Browser }>
results-- an array ofPageResultobjects, one per URL, in the same order as the inputbrowserInstance-- the PlaywrightBrowserinstance used (close it when done)
| Error message | Cause |
|---|---|
urlListIsEmpty |
Empty urls array was provided |
maxSimultaneousPagesNotSet |
maxSimultaneousPages was explicitly set to 0 or a falsy value |
All fields are optional.
| Option | Type | Default | Description |
|---|---|---|---|
maxSimultaneousPages |
number |
CPU cores - 1 | Number of concurrent browser tabs |
additionalWaitSeconds |
number |
1 |
Extra wait time after page load (seconds) |
playwrightOptions?: {
launch?: LaunchOptions;
context?: BrowserContextOptions;
navigationWaitUntil?: 'load' | 'domcontentloaded' | 'networkidle' | 'commit';
}| Option | Default | Description |
|---|---|---|
launch |
{ headless: true } |
Passed to chromium.launch(). See Playwright LaunchOptions. |
context |
{} |
Passed to browser.newContext(). See Playwright BrowserContextOptions. |
navigationWaitUntil |
'networkidle' |
When to consider navigation complete. Options: 'load', 'domcontentloaded', 'networkidle', 'commit'. |
See Stealth documentation for details.
stealth?: StealthOptions;| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean |
true |
Enable stealth features |
randomizeUserAgent |
boolean |
true |
Use a random user agent per context |
randomizeViewport |
boolean |
true |
Use a random common viewport size |
humanLikeDelays |
boolean |
true |
Add random delays between navigations |
userAgents |
string[] |
built-in list | Custom user agent strings to pick from |
See Resilience documentation for details.
resilience?: ResilienceOptions;| Option | Type | Default | Description |
|---|---|---|---|
retries |
number |
2 |
Number of retry attempts per URL |
retryDelayMs |
number |
1000 |
Base delay for exponential backoff (ms) |
timeoutMs |
number |
30000 |
Navigation timeout per page (ms) |
continueOnError |
boolean |
true |
Collect errors instead of throwing |
rateLimitPerSecond |
number |
0 |
Max requests per second (0 = unlimited) |
beforeGoToUrl?: (page: Page) => Promise<void>;Called before each URL is loaded. Use it to configure the page (e.g., set viewport, inject cookies).
onRoute?: (route: Route, request: Request) => void;Called for every network request the page makes. Use it to block resources, modify headers, or mock responses. Replaces Puppeteer's setRequestInterception pattern.
You must call either route.continue(), route.abort(), or route.fulfill() for each request.
onPageLoadedForEachUrl?: (page: Page, currentUrl: string) => any;Called after each page finishes loading. The return value is stored in PageResult.pageData. If not set, pageData defaults to the full HTML content via page.content().
onProgress?: (resultNumber: number, totalNumber: number, internalPageIndex: number) => void;Called after each URL finishes processing within a page tab.
resultNumber-- number of URLs processed so far in this tab (1-indexed)totalNumber-- total URLs assigned to this tabinternalPageIndex-- the tab index (0-indexed)
onEvaluateForEachUrl?: {
[key: string]: string | (() => any);
};Define named extractors that run against each page. Each key becomes a field in PageResult.evaluates.
Two value types:
| Type | Behavior | Example |
|---|---|---|
string |
CSS selector -- returns textContent of the first match |
'h1', '.price', '#title' |
() => any |
Function -- runs via page.evaluate() in the browser context |
() => document.title |
onEvaluateForEachUrl: {
heading: 'h1', // CSS selector
price: '.product-price', // CSS selector
linkCount: () => document.links.length, // function
}interface PageResult {
url: string;
pageData: any;
evaluates: null | { [k: string]: any };
error?: string | null;
}| Field | Type | Description |
|---|---|---|
url |
string |
The URL that was scraped |
pageData |
any |
Full HTML content, or the return value of onPageLoadedForEachUrl |
evaluates |
object | null |
Results of onEvaluateForEachUrl extractors, keyed by name |
error |
string | undefined |
Error message if the URL failed (only when continueOnError: true) |
Lower-level export for managing browser instances manually.
import { initBrowser } from 'html-data-scraper';
const browser = await initBrowser({
playwrightOptions: {
launch: { headless: false },
},
});Returns a cached Browser instance. Calling with the same config returns the same instance. Calling with different config creates a new one.
All types are exported from the main module:
import type {
CustomConfigurations,
StealthOptions,
ResilienceOptions,
EvaluateValue,
OnProgressFunction,
OnPageLoadedFunction,
OnRouteHandler,
PageResult,
} from 'html-data-scraper';For complete, runnable projects that demonstrate the API in practice, see the examples/ folder:
- Wikipedia Scraper -- uses
onEvaluateForEachUrl(mixed selectors and functions),onRoute,onProgress, andmaxSimultaneousPages - News Monitor -- uses
stealth,resilience(rate limiting, retries,continueOnError),onPageLoadedForEachUrl,onRoute, andbeforeGoToUrl