DataAnalystBot is an interactive, AI-powered assistant designed to help users with all things data analysis. It leverages advanced retrieval-augmented generation (RAG) techniques, a custom vector database, and a conversational interface to provide expert guidance on data cleaning, visualization, statistics, machine learning, and popular tools like Python, SQL, Excel, and more.
- Conversational AI: Chat with an LLM (Llama 3/4 via Groq) about any data analysis topic.
- Multi-File Upload & Analysis: Upload and analyze images (charts, screenshots), CSV/Excel files, and PDFs simultaneously. The bot uses all provided files as context for your question via the
/multi-uploadendpoint. - Data Cleaning & Analysis Endpoints: Use
/analyze-datafor full AI-powered analysis (cleaning, stats, insights, visualizations) and/clean-datafor fast, quota-free cleaning and summary. - Modern GUI: Redesigned Streamlit interface with tabs for chat and data upload, sidebar controls, recent chat management, and raw data preview.
- Image Understanding: Upload images and ask questions about them. The bot uses a multimodal LLM to analyze and respond, then grounds the answer using your chat history and knowledge base.
- CSV Data Analysis: Upload a CSV file and ask questions about its content. The bot uses the CSV content as context for the LLM, providing data-aware answers.
- PDF Data Analysis: Upload a PDF file and ask questions about its content. The bot extracts text from the PDF and uses it as context for the LLM, enabling document-aware responses.
- File Caching: Uploaded CSV, image, and PDF data are cached for each session, enabling fast, context-aware follow-up questions without re-uploading or re-processing files.
- Image Upload Rate Limiting: Each user can upload up to 3 images every 6 hours. If the limit is reached, only text, CSV, or PDF questions are allowed until the window resets.
- Image Display in Chat: Uploaded images are shown inline with your messages for easy reference.
- Retrieval-Augmented Generation (RAG): Answers are grounded in a curated, chunked knowledge base from top data science sources.
- Session Memory: Each user session maintains its own chat history for context-aware conversations.
- Recent Chats: All conversations are saved and can be resumed from the sidebar.
- Custom Vector Database: Fast, semantic search over chunked documents using FAISS and HuggingFace embeddings.
- Modern UI: Built with Streamlit for a clean, interactive chat experience.
- Extensible Scrapers: Easily add new data sources with modular web scrapers.
Chat with DataAnalystBot about Power BI for data analysis!
flowchart TD
subgraph "π€ User Interface"
A[π€ User] -->|π€ Uploads Files & Asks Questions| B[π₯οΈ Streamlit Web App]
end
subgraph "π Processing Layer"
B -->|π‘ Sends Request| C[β‘ FastAPI Server]
C -->|πΎ Stores Uploads| J[π File Storage]
C -->|π Retrieves Context| E[ποΈ Vector Database]
C -->|π§ Generates Answer| D[π€ AI Model - Groq]
end
subgraph "πΎ Data Storage"
E[ποΈ FAISS Vector Database]
F[π€ HuggingFace Embeddings]
G[π Session Memory]
I[β‘ Cache Storage]
K[π¬ Chat History]
H[π·οΈ Web Scrapers]
end
%% Data Flow
E --> F
H -->|π Adds Scraped Data| E
C -->|πΎ Saves Session| G
C -->|β‘ Caches Results| I
C -->|π¬ Stores Chats| K
%% Response Flow
D -->|β
AI Response| C
C -->|π Final Answer| B
B -->|πΊ Shows Result| A
class A,B userStyle
class C,D,J processStyle
class E,F,G,H,I,K storageStyle
All articles are scraped, chunked (500 chars), and stored in data/data.jsonl for efficient retrieval.
- Frontend: Streamlit
- Backend: FastAPI
- LLM: Groq Llama 3 & Multimodal Llama 4
- Vector DB: FAISS
- Embeddings: HuggingFace Transformers
- Web Scraping: Selenium
- Session Memory: In-memory per-session chat history
- Caching: DiskCache and Streamlit cache for fast file and context retrieval
git clone https://github.com/Lokesh-DataScience/Data-Analyst-Expert-Bot.git
cd DataAnalystBotpython -m venv .venv
.venv\Scripts\activate # On Windows
pip install -r requirements.txtCreate a .env file in the root directory:
GROQ_API_KEY=your_groq_api_key
LANGSMITH_API_KEY=your_langsmith_api_keyRun the scrapers in the scrapers/ folder to populate data/data.jsonl with chunked content:
python scrapers/gfg_scraper.py
python scrapers/pointtech_scraper.py
python scrapers/towardsdatascience_scrapper.pypython vector_db/faiss_db.pyuvicorn api.main:app --reloadstreamlit run streamlit_app/app.py- Open http://localhost:8501 in your browser.
- Ask questions about data analysis, tools, or techniques.
- To analyze an image: Upload a jpg, jpeg, or png file and enter your question. The bot will analyze the image and respond.
- To analyze a CSV: Upload a CSV file and ask a question about its content. The bot will use the CSV data as context for its answer.
- To analyze a PDF: Upload a PDF file and ask a question about its content. The bot will use the PDF text as context for its answer.
- Note: You can upload up to 3 images every 6 hours. If you reach the limit, you can still ask text questions.
- Resume conversations: Select any recent chat from the sidebar to continue where you left off.
DataAnalystBot/
β
βββ api/ # FastAPI backend
β βββ main.py
βββ chains/ # RAG chain construction
β βββ rag_chain.py
βββ data/ # Chunked knowledge base (JSONL)
β βββ data.jsonl
βββ loaders/ # Data loading utilities
β βββ load_data.py
| βββ load_csv.py
| βββ load_pdf.py
βββ memory/ # Session memory management
β βββ session_memory.py
βββ scrapers/ # Web scrapers for sources
β βββ gfg_scraper.py
β βββ pointtech_scraper.py
β βββ towardsdatascience_scrapper.py
βββ streamlit_app/ # Streamlit UI
β βββ components/
β βββ config/
β βββ styles/
| βββ utils/
| βββ app.py
βββ vector_db/ # Vector DB creation/loading
β βββ faiss_db.py
βββ requirements.txt
βββ README.md
- Add new sources: Write a new scraper in
scrapers/, chunk the content, and append todata/data.jsonl. - Change chunk size: Adjust the
textwrap.wrap(..., width=500)in scrapers. - Swap LLM or embeddings: Update model names in
chains/rag_chain.pyorvector_db/faiss_db.py. - Switch between full analysis and fast cleaning: Use
/analyze-datafor AI-powered insights, or/clean-datafor quick cleaning and stats.
- All chat history is stored in memory per session and is not persisted.
- API keys are loaded from
.envand never exposed to the frontend.
Pull requests, issues, and feature suggestions are welcome!
Please open an issue or submit a PR.
MIT License. See LICENSE for details.
Happy Analyzing! π