Skip to content

WebScraperAI is a powerful tool that enables users to perform question-answering on website content using web scraping and retrieval-augmented generation (RAG) with LlamaIndex. It supports multiple LLMs, including OpenAI GPT-3.5, GPT-4, Gemini Pro, Gemini Ultra, and DeepSeek.

Notifications You must be signed in to change notification settings

tejas-130704/WebScraperAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebScraperAI

Overview

WebScraperAI is a web-based tool that allows users to perform question-answering on a given website URL. It supports multiple LLMs and has two modes of operation:

Preview

Screenshot 2025-02-23 130842

Screenshot 2025-02-23 132439

Screenshot 2025-02-23 134013

Screenshot 2025-02-25 092021

Screenshot 2025-02-25 092727

Screenshot 2025-02-25 093013

  1. Page-Specific Q&A: Extracts information only from the given webpage.
  2. Deep Analysis Q&A: Extracts information from the given page and all its linked pages (use cautiously, as it may take a long time for large websites).

The project uses BeautifulSoup for web scraping and a RAG pipeline in LlamaIndex and HuggingFace to enhance response accuracy. Supported LLMs include:

  • OpenAI GPT-3.5
  • OpenAI GPT-4
  • Gemini Pro
  • Gemini Ultra
  • DeepSeek
  • Groq

Features

  • Extracts and analyzes website content for Q&A.
  • Offers two modes: specific page analysis and deep analysis.
  • Supports multiple LLMs for flexibility.
  • Built with Streamlit for an interactive UI.

Installation

Follow these steps to set up the project on your local machine:

1. Clone the Repository

git clone https://github.com/tejas-130704/WebScraperAI.git
cd WebScraperAI

2. Create a Virtual Environment

python -m venv venv

3. Activate the Virtual Environment

  • Windows:
    venv\Scripts\activate
  • Mac/Linux:
    source venv/bin/activate

4. Install Dependencies

pip install -r requirements.txt

Usage

1. Run the Streamlit App

streamlit run app.py

2. Enter Details

  • Select Model: Choose an LLM for processing.
  • Enter API Key: Provide the API key for the selected LLM.
  • Enter Website URL: Input the URL to analyze.
  • Choose Deep Analysis (Optional): Check this box if you want to analyze linked pages.

3. Click Load Website & LLM to start the process.

  • After processing, enter a question related to the webpage and click Ask Question.

Caution ⚠️

  • Use Deep Analysis Only for Limited Scope Websites: Avoid using it on large websites like Wikipedia, as the high number of linked pages may cause extreme delays or failures.
  • Respect Website Policies: Some sites may have anti-scraping policies. Always ensure compliance.
  • API Limits: LLM responses are subject to API limits and costs depending on the provider.

Future Enhancements

  • Implement caching to improve deep analysis speed.
  • Add support for multi-threaded scraping.
  • Introduce a ranking system for LLM performance comparison.

Contributing

Pull requests are welcome! If you find any issues, feel free to open an issue in the repository.

About

WebScraperAI is a powerful tool that enables users to perform question-answering on website content using web scraping and retrieval-augmented generation (RAG) with LlamaIndex. It supports multiple LLMs, including OpenAI GPT-3.5, GPT-4, Gemini Pro, Gemini Ultra, and DeepSeek.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages