WebScraperAI

Overview

WebScraperAI is a web-based tool that allows users to perform question-answering on a given website URL. It supports multiple LLMs and has two modes of operation:

Preview

Page-Specific Q&A: Extracts information only from the given webpage.
Deep Analysis Q&A: Extracts information from the given page and all its linked pages (use cautiously, as it may take a long time for large websites).

The project uses BeautifulSoup for web scraping and a RAG pipeline in LlamaIndex and HuggingFace to enhance response accuracy. Supported LLMs include:

OpenAI GPT-3.5
OpenAI GPT-4
Gemini Pro
Gemini Ultra
DeepSeek
Groq

Features

Extracts and analyzes website content for Q&A.
Offers two modes: specific page analysis and deep analysis.
Supports multiple LLMs for flexibility.
Built with Streamlit for an interactive UI.

Installation

Follow these steps to set up the project on your local machine:

1. Clone the Repository

git clone https://github.com/tejas-130704/WebScraperAI.git
cd WebScraperAI

2. Create a Virtual Environment

python -m venv venv

3. Activate the Virtual Environment

Windows:
```
venv\Scripts\activate
```
Mac/Linux:
```
source venv/bin/activate
```

4. Install Dependencies

pip install -r requirements.txt

Usage

1. Run the Streamlit App

streamlit run app.py

2. Enter Details

Select Model: Choose an LLM for processing.
Enter API Key: Provide the API key for the selected LLM.
Enter Website URL: Input the URL to analyze.
Choose Deep Analysis (Optional): Check this box if you want to analyze linked pages.

3. Click Load Website & LLM to start the process.

After processing, enter a question related to the webpage and click Ask Question.

Caution ⚠️

Use Deep Analysis Only for Limited Scope Websites: Avoid using it on large websites like Wikipedia, as the high number of linked pages may cause extreme delays or failures.
Respect Website Policies: Some sites may have anti-scraping policies. Always ensure compliance.
API Limits: LLM responses are subject to API limits and costs depending on the provider.

Future Enhancements

Implement caching to improve deep analysis speed.
Add support for multi-threaded scraping.
Introduce a ranking system for LLM performance comparison.

Contributing

Pull requests are welcome! If you find any issues, feel free to open an issue in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
app.py		app.py
preprocessing_data.py		preprocessing_data.py
requirements.txt		requirements.txt
response.py		response.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebScraperAI

Overview

Preview

Features

Installation

1. Clone the Repository

2. Create a Virtual Environment

3. Activate the Virtual Environment

4. Install Dependencies

Usage

1. Run the Streamlit App

2. Enter Details

3. Click Load Website & LLM to start the process.

Caution ⚠️

Future Enhancements

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tejas-130704/WebScraperAI

Folders and files

Latest commit

History

Repository files navigation

WebScraperAI

Overview

Preview

Features

Installation

1. Clone the Repository

2. Create a Virtual Environment

3. Activate the Virtual Environment

4. Install Dependencies

Usage

1. Run the Streamlit App

2. Enter Details

3. Click Load Website & LLM to start the process.

Caution ⚠️

Future Enhancements

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages