Skip to content

Lokesh-DataScience/Data-Analyst-Expert-Bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataAnalystBot πŸ€–

DataAnalystBot is an interactive, AI-powered assistant designed to help users with all things data analysis. It leverages advanced retrieval-augmented generation (RAG) techniques, a custom vector database, and a conversational interface to provide expert guidance on data cleaning, visualization, statistics, machine learning, and popular tools like Python, SQL, Excel, and more.


πŸš€ Features

  • Conversational AI: Chat with an LLM (Llama 3/4 via Groq) about any data analysis topic.
  • Multi-File Upload & Analysis: Upload and analyze images (charts, screenshots), CSV/Excel files, and PDFs simultaneously. The bot uses all provided files as context for your question via the /multi-upload endpoint.
  • Data Cleaning & Analysis Endpoints: Use /analyze-data for full AI-powered analysis (cleaning, stats, insights, visualizations) and /clean-data for fast, quota-free cleaning and summary.
  • Modern GUI: Redesigned Streamlit interface with tabs for chat and data upload, sidebar controls, recent chat management, and raw data preview.
  • Image Understanding: Upload images and ask questions about them. The bot uses a multimodal LLM to analyze and respond, then grounds the answer using your chat history and knowledge base.
  • CSV Data Analysis: Upload a CSV file and ask questions about its content. The bot uses the CSV content as context for the LLM, providing data-aware answers.
  • PDF Data Analysis: Upload a PDF file and ask questions about its content. The bot extracts text from the PDF and uses it as context for the LLM, enabling document-aware responses.
  • File Caching: Uploaded CSV, image, and PDF data are cached for each session, enabling fast, context-aware follow-up questions without re-uploading or re-processing files.
  • Image Upload Rate Limiting: Each user can upload up to 3 images every 6 hours. If the limit is reached, only text, CSV, or PDF questions are allowed until the window resets.
  • Image Display in Chat: Uploaded images are shown inline with your messages for easy reference.
  • Retrieval-Augmented Generation (RAG): Answers are grounded in a curated, chunked knowledge base from top data science sources.
  • Session Memory: Each user session maintains its own chat history for context-aware conversations.
  • Recent Chats: All conversations are saved and can be resumed from the sidebar.
  • Custom Vector Database: Fast, semantic search over chunked documents using FAISS and HuggingFace embeddings.
  • Modern UI: Built with Streamlit for a clean, interactive chat experience.
  • Extensible Scrapers: Easily add new data sources with modular web scrapers.

πŸ“Έ Screenshots

Chat UI Example Chat with DataAnalystBot about Power BI for data analysis!


πŸ—οΈ Architecture Overview

flowchart TD
    subgraph "πŸ‘€ User Interface"
        A[πŸ‘€ User] -->|πŸ“€ Uploads Files & Asks Questions| B[πŸ–₯️ Streamlit Web App]
    end

    subgraph "πŸ”„ Processing Layer"
        B -->|πŸ“‘ Sends Request| C[⚑ FastAPI Server]
        C -->|πŸ’Ύ Stores Uploads| J[πŸ“ File Storage]
        C -->|πŸ” Retrieves Context| E[πŸ—„οΈ Vector Database]
        C -->|🧠 Generates Answer| D[πŸ€– AI Model - Groq]
    end

    subgraph "πŸ’Ύ Data Storage"
        E[πŸ—„οΈ FAISS Vector Database]
        F[πŸ”€ HuggingFace Embeddings]
        G[πŸ’­ Session Memory]
        I[⚑ Cache Storage]
        K[πŸ’¬ Chat History]
        H[πŸ•·οΈ Web Scrapers]
    end

    %% Data Flow
    E --> F
    H -->|πŸ“Š Adds Scraped Data| E
    C -->|πŸ’Ύ Saves Session| G
    C -->|⚑ Caches Results| I
    C -->|πŸ’¬ Stores Chats| K
    
    %% Response Flow
    D -->|βœ… AI Response| C
    C -->|πŸ“‹ Final Answer| B
    B -->|πŸ“Ί Shows Result| A
    
    class A,B userStyle
    class C,D,J processStyle
    class E,F,G,H,I,K storageStyle
Loading

πŸ“š Data Sources

All articles are scraped, chunked (500 chars), and stored in data/data.jsonl for efficient retrieval.


πŸ› οΈ Tech Stack


⚑ Quickstart

1. Clone the Repository

git clone https://github.com/Lokesh-DataScience/Data-Analyst-Expert-Bot.git
cd DataAnalystBot

2. Install Dependencies

python -m venv .venv
.venv\Scripts\activate  # On Windows
pip install -r requirements.txt

3. Set Up Environment Variables

Create a .env file in the root directory:

GROQ_API_KEY=your_groq_api_key
LANGSMITH_API_KEY=your_langsmith_api_key

4. Scrape and Prepare Data

Run the scrapers in the scrapers/ folder to populate data/data.jsonl with chunked content:

python scrapers/gfg_scraper.py
python scrapers/pointtech_scraper.py
python scrapers/towardsdatascience_scrapper.py

5. Build the Vector Database

python vector_db/faiss_db.py

6. Start the Backend API

uvicorn api.main:app --reload

7. Launch the Streamlit Frontend

streamlit run streamlit_app/app.py

πŸ’¬ Usage

  • Open http://localhost:8501 in your browser.
  • Ask questions about data analysis, tools, or techniques.
  • To analyze an image: Upload a jpg, jpeg, or png file and enter your question. The bot will analyze the image and respond.
  • To analyze a CSV: Upload a CSV file and ask a question about its content. The bot will use the CSV data as context for its answer.
  • To analyze a PDF: Upload a PDF file and ask a question about its content. The bot will use the PDF text as context for its answer.
  • Note: You can upload up to 3 images every 6 hours. If you reach the limit, you can still ask text questions.
  • Resume conversations: Select any recent chat from the sidebar to continue where you left off.

🧩 Project Structure

DataAnalystBot/
β”‚
β”œβ”€β”€ api/                  # FastAPI backend
β”‚   └── main.py
β”œβ”€β”€ chains/               # RAG chain construction
β”‚   └── rag_chain.py
β”œβ”€β”€ data/                 # Chunked knowledge base (JSONL)
β”‚   └── data.jsonl
β”œβ”€β”€ loaders/              # Data loading utilities
β”‚   β”œβ”€β”€ load_data.py
|   β”œβ”€β”€ load_csv.py
|   └── load_pdf.py
β”œβ”€β”€ memory/               # Session memory management
β”‚   └── session_memory.py
β”œβ”€β”€ scrapers/             # Web scrapers for sources
β”‚   β”œβ”€β”€ gfg_scraper.py
β”‚   β”œβ”€β”€ pointtech_scraper.py
β”‚   └── towardsdatascience_scrapper.py
β”œβ”€β”€ streamlit_app/        # Streamlit UI
β”‚   β”œβ”€β”€ components/
β”‚   β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ styles/
|   β”œβ”€β”€ utils/
|   └── app.py
β”œβ”€β”€ vector_db/            # Vector DB creation/loading
β”‚   └── faiss_db.py
β”œβ”€β”€ requirements.txt
└── README.md

πŸ“ Customization

  • Add new sources: Write a new scraper in scrapers/, chunk the content, and append to data/data.jsonl.
  • Change chunk size: Adjust the textwrap.wrap(..., width=500) in scrapers.
  • Swap LLM or embeddings: Update model names in chains/rag_chain.py or vector_db/faiss_db.py.
  • Switch between full analysis and fast cleaning: Use /analyze-data for AI-powered insights, or /clean-data for quick cleaning and stats.

πŸ›‘οΈ Security & Privacy

  • All chat history is stored in memory per session and is not persisted.
  • API keys are loaded from .env and never exposed to the frontend.

🀝 Contributing

Pull requests, issues, and feature suggestions are welcome!
Please open an issue or submit a PR.


πŸ“„ License

MIT License. See LICENSE for details.


πŸ™ Acknowledgements


Happy Analyzing! πŸš€

Releases

No releases published

Packages

No packages published

Languages