This project implements a Retrieval-Augmented Generation (RAG) pipeline for querying unstructured PDF documents (Research Papers from arXiv).
This bot will summarize the Research papers related to AI/ML in response to the user query about a Research Paper.
It combines embeddings, vector search, and a large language model to return context-aware answers in real time.
Note: Limited Data
- Document Ingestion (
core/data_loader.py): Load and chunk PDF documents. - Embeddings (
core/embedding_manager.py): Generate 384-dim sentence embeddings withall-MiniLM-L6-v2. - Vector Store (
core/vector_store.py): Store and search embeddings using ChromaDB (HNSW indexing). - Retriever (
core/retriever.py): Fetch relevant context for queries. - Pipeline (
pipelines/rag_pipeline.py): Combine retriever + LLM (Google’sgemma2-9b-it) for RAG responses. - Streamlit UI (
main.py): Simple and interactive interface for querying documents. - Configurable (
config.py): Centralized settings for model, database, and pipeline options. - Experiments (
notebooks/rag_pipeline.ipynb).
This project uses uv for Python package management.
Make sure you have uv installed first:
pip install uvClone the repo and install dependencies:
git clone https://github.com/<your-username>/<repo-name>.git
cd <repo-name>
uv syncBuild the databse (this is a onetime setup):
- Upload PDFs to the
data/pdf_files path - Then run this command
python main.py --buildAPI Setup:
- Get your API key to the gemma2-9b-it model from here groq-api-keys.
- Create a
.envfile in your project root path and assign your API key toGROQ_API_KEY.
Start the Streamlit app in local:
streamlit run app.pyType your query about a research paper published, and get context-aware answers.
.
├── index_evaluation/ # Similarity search techniques Benchmarking
│ ├── vector_store_interface.py # Common interface for benchmarking different ANN techniques
├── core/ # Core components
│ ├── data_loader.py # PDF loading + chunking
│ ├── embedding_manager.py # Embedding generation
│ ├── retriever.py # Context retrieval
│ └── vector_store.py # ChromaDB integration
│
├── data/ # Input and storage
│ ├── pdf_files/ # Source documents
│ └── vector_store/ # Persisted ChromaDB index
│
├── notebooks/
│ └── rag_pipeline.ipynb # Experiments & benchmarks
│
├── pipelines/
│ └── rag_pipeline.py # Full RAG pipeline logic
│
├── config.py # Global configs
├── main.py # Streamlit entry point
├── pyproject.toml # uv dependencies
├── requirements.txt # pip fallback
├── uv.lock # uv lock file
├── .gitignore
└── README.md
- Benchmark the retrieval strategies and integrate the best in the Q&A Bot.
- https://www.youtube.com/watch?v=fZM3oX4xEyg&list=PLZoTAELRMXVM8Pf4U67L4UuDRgV4TNX9D
- https://www.singlestore.com/blog/a-guide-to-retrieval-augmented-generation-rag/
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- https://python.langchain.com/docs/introduction/
- https://console.groq.com/docs/