Web Scraping

A TypeScript-based web scraper that fetches page HTML content using Puppeteer with stealth mode. Outputs scraped pages to a dataset for downstream use (e.g. AI search, indexing).

Features

Puppeteer + Stealth — Uses puppeteer-extra with the stealth plugin to reduce bot detection
Parallel scraping — Concurrent requests (up to 10) via p-limit
Configurable targets — Define URLs to scrape in resources.ts
Dataset output — Saves HTML files to dataset/pageContent/ and an index at dataset/index.json

Project Structure

├── resources.ts       # URLs to scrape (name + url)
├── scripts/
│   └── scrape.ts      # Main scraping script
├── dataset/
│   ├── index.json     # Metadata + file paths for each scraped page
│   └── pageContent/   # HTML files (one per resource)
└── package.json

Setup

pnpm install
pnpm add-chrome   # Install Chromium for Puppeteer

Usage

pnpm scrape

The script will:

Read URLs from resources.ts
Fetch each page’s HTML
Save HTML to dataset/pageContent/<name>.html
Write dataset/index.json with metadata and file paths

Failed pages are recorded with pageContent: null and metadata only.

Configuration

Edit resources.ts to add or change URLs:

export const resources = [
  { name: "Google", url: "https://www.google.com" },
  { name: "YouTube", url: "https://www.youtube.com" },
];

Tech Stack

TypeScript — tsx for execution
Puppeteer — Headless Chrome automation
puppeteer-extra-plugin-stealth — Evade common bot detection
p-limit — Concurrency control

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
resources.ts		resources.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping

Features

Project Structure

Setup

Usage

Configuration

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraping

Features

Project Structure

Setup

Usage

Configuration

Tech Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages