Skip to content

akashkmt/web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping

A TypeScript-based web scraper that fetches page HTML content using Puppeteer with stealth mode. Outputs scraped pages to a dataset for downstream use (e.g. AI search, indexing).

Features

  • Puppeteer + Stealth — Uses puppeteer-extra with the stealth plugin to reduce bot detection
  • Parallel scraping — Concurrent requests (up to 10) via p-limit
  • Configurable targets — Define URLs to scrape in resources.ts
  • Dataset output — Saves HTML files to dataset/pageContent/ and an index at dataset/index.json

Project Structure

├── resources.ts       # URLs to scrape (name + url)
├── scripts/
│   └── scrape.ts      # Main scraping script
├── dataset/
│   ├── index.json     # Metadata + file paths for each scraped page
│   └── pageContent/   # HTML files (one per resource)
└── package.json

Setup

pnpm install
pnpm add-chrome   # Install Chromium for Puppeteer

Usage

pnpm scrape

The script will:

  1. Read URLs from resources.ts
  2. Fetch each page’s HTML
  3. Save HTML to dataset/pageContent/<name>.html
  4. Write dataset/index.json with metadata and file paths

Failed pages are recorded with pageContent: null and metadata only.

Configuration

Edit resources.ts to add or change URLs:

export const resources = [
  { name: "Google", url: "https://www.google.com" },
  { name: "YouTube", url: "https://www.youtube.com" },
];

Tech Stack

  • TypeScripttsx for execution
  • Puppeteer — Headless Chrome automation
  • puppeteer-extra-plugin-stealth — Evade common bot detection
  • p-limit — Concurrency control

About

TypeScript web scraper using Puppeteer (stealth mode) to fetch page HTML and save it to a dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors