Skip to content

myselfsandip/web-scraper

Repository files navigation

🔍 Web Scraper Pro

A modern full-stack web scraping application built with Next.js, designed to extract, analyze, and export website data effortlessly.
It offers server-side HTML parsing, intelligent data extraction, and multi-format export capabilities — all within a beautiful and responsive interface.

Next.js TypeScript License


✨ Features

  • 🌐 Universal Web Scraping – Extract structured data from any publicly accessible website
  • 📊 Structured Data Extraction – Parse headings, paragraphs, links, images, tables, and metadata automatically
  • 💾 Multi-Format Export – Download data as JSON, CSV, or Excel files
  • 🎯 Intelligent URL Resolution – Automatically converts relative URLs into absolute paths
  • Real-Time Processing – Instant feedback with progress indicators and loading states
  • 🎨 Modern UI – Responsive, minimal, and dark-mode ready (built with TailwindCSS)
  • 🛡️ Ethical Scraping – Built-in rate limiting and User-Agent rotation
  • 📱 Mobile Friendly – Works seamlessly on all screen sizes

🛠️ Tech Stack

Frontend

  • Next.js 14 – React framework with App Router
  • TypeScript – Type-safe development
  • TailwindCSS – Utility-first CSS framework
  • Lucide React – Icon system
  • Shadcn/ui – Reusable UI components

Backend

  • Next.js API Routes – Serverless endpoints
  • Cheerio – Fast HTML parser
  • XLSX – Excel file generator

🚀 Getting Started

Prerequisites

  • Node.js 18+
  • npm or yarn

Installation

# Clone repository
git clone https://github.com/yourusername/web-scraper-pro.git
cd web-scraper-pro

# Install dependencies
pnpm install
# or
npm install

# Run development server
pnpm run dev
# or
npm dev

Then, open http://localhost:3000 in your browser.


📖 Usage

  1. Enter a website URL (e.g., https://example.com)
  2. Click "Scrape Website"
  3. Wait for completion and view organized data
  4. Export results as JSON

🔧 API Documentation

POST /api/scrape

Scrapes a website and returns structured data.

Request Body

{ "url": "https://example.com" }

Response

{
  "url": "https://example.com",
  "title": "Example Domain",
  "description": "Example website description",
  "headings": ["Heading 1", "Heading 2"],
  "paragraphs": ["Paragraph text..."],
  "links": [{ "text": "Link text", "href": "https://example.com/link" }],
  "images": ["https://example.com/image.jpg"],
  "tables": [{ "headers": ["Col 1", "Col 2"], "rows": [["Data 1", "Data 2"]] }]
}

Error Response

{ "error": "Failed to scrape website: HTTP 404" }

📂 Project Structure

web-scraper-pro/
├── app/
│   ├── api/
│   │   └── scrape/
│   │       └── route.ts        # API endpoint
│   ├── layout.tsx              # Root layout
│   └── page.tsx                # Home page
├── components/
│   ├── ui/                     # UI components
│   ├── data-display.tsx        # Data visualization
│   ├── footer.tsx              # Footer
│   └── url-form.tsx            # URL input form
├── lib/
│   └── utils.ts                # Utility functions
├── public/                     # Static assets
├── package.json
├── tailwind.config.ts
├── tsconfig.json
└── README.md

⚖️ Legal & Ethical Considerations

This project is for educational purposes only. Please follow ethical scraping practices:

✅ Scrape only public data
✅ Respect robots.txt and site Terms of Service
✅ Implement rate limiting
❌ Do not scrape personal/sensitive data
❌ Do not bypass authentication or paywalls
❌ Do not republish copyrighted content

Disclaimer: You are responsible for ensuring compliance with all applicable laws.


🔐 Best Practices Implemented

  • User-Agent headers for scraper requests
  • Graceful error handling
  • Configurable rate limiting
  • Content size protection
  • URL validation and normalization

📝 License

Licensed under the MIT License. See the LICENSE file for details.


👨‍💻 Author

Sandip Singha


🙏 Acknowledgments


If you found this project helpful, please give it a star!

About

Web scraping application built with Next.js, designed to extract, analyze, and export website data effortlessly.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published