WebScraperAI is a web-based tool that allows users to perform question-answering on a given website URL. It supports multiple LLMs and has two modes of operation:
- Page-Specific Q&A: Extracts information only from the given webpage.
- Deep Analysis Q&A: Extracts information from the given page and all its linked pages (use cautiously, as it may take a long time for large websites).
The project uses BeautifulSoup for web scraping and a RAG pipeline in LlamaIndex and HuggingFace to enhance response accuracy. Supported LLMs include:
- OpenAI GPT-3.5
- OpenAI GPT-4
- Gemini Pro
- Gemini Ultra
- DeepSeek
- Groq
- Extracts and analyzes website content for Q&A.
- Offers two modes: specific page analysis and deep analysis.
- Supports multiple LLMs for flexibility.
- Built with Streamlit for an interactive UI.
Follow these steps to set up the project on your local machine:
git clone https://github.com/tejas-130704/WebScraperAI.git
cd WebScraperAIpython -m venv venv- Windows:
venv\Scripts\activate
- Mac/Linux:
source venv/bin/activate
pip install -r requirements.txtstreamlit run app.py- Select Model: Choose an LLM for processing.
- Enter API Key: Provide the API key for the selected LLM.
- Enter Website URL: Input the URL to analyze.
- Choose Deep Analysis (Optional): Check this box if you want to analyze linked pages.
- After processing, enter a question related to the webpage and click Ask Question.
- Use Deep Analysis Only for Limited Scope Websites: Avoid using it on large websites like Wikipedia, as the high number of linked pages may cause extreme delays or failures.
- Respect Website Policies: Some sites may have anti-scraping policies. Always ensure compliance.
- API Limits: LLM responses are subject to API limits and costs depending on the provider.
- Implement caching to improve deep analysis speed.
- Add support for multi-threaded scraping.
- Introduce a ranking system for LLM performance comparison.
Pull requests are welcome! If you find any issues, feel free to open an issue in the repository.





