A TypeScript-based web scraper that fetches page HTML content using Puppeteer with stealth mode. Outputs scraped pages to a dataset for downstream use (e.g. AI search, indexing).
- Puppeteer + Stealth — Uses
puppeteer-extrawith the stealth plugin to reduce bot detection - Parallel scraping — Concurrent requests (up to 10) via
p-limit - Configurable targets — Define URLs to scrape in
resources.ts - Dataset output — Saves HTML files to
dataset/pageContent/and an index atdataset/index.json
├── resources.ts # URLs to scrape (name + url)
├── scripts/
│ └── scrape.ts # Main scraping script
├── dataset/
│ ├── index.json # Metadata + file paths for each scraped page
│ └── pageContent/ # HTML files (one per resource)
└── package.json
pnpm install
pnpm add-chrome # Install Chromium for Puppeteerpnpm scrapeThe script will:
- Read URLs from
resources.ts - Fetch each page’s HTML
- Save HTML to
dataset/pageContent/<name>.html - Write
dataset/index.jsonwith metadata and file paths
Failed pages are recorded with pageContent: null and metadata only.
Edit resources.ts to add or change URLs:
export const resources = [
{ name: "Google", url: "https://www.google.com" },
{ name: "YouTube", url: "https://www.youtube.com" },
];- TypeScript —
tsxfor execution - Puppeteer — Headless Chrome automation
- puppeteer-extra-plugin-stealth — Evade common bot detection
- p-limit — Concurrency control