DocQuery: Research Q&A Bot

This project implements a Retrieval-Augmented Generation (RAG) pipeline for querying unstructured PDF documents (Research Papers from arXiv).

This bot will summarize the Research papers related to AI/ML in response to the user query about a Research Paper.

It combines embeddings, vector search, and a large language model to return context-aware answers in real time.

Note: Limited Data

📊 Application Workflow

🚀 Features

Document Ingestion (core/data_loader.py): Load and chunk PDF documents.
Embeddings (core/embedding_manager.py): Generate 384-dim sentence embeddings with all-MiniLM-L6-v2.
Vector Store (core/vector_store.py): Store and search embeddings using ChromaDB (HNSW indexing).
Retriever (core/retriever.py): Fetch relevant context for queries.
Pipeline (pipelines/rag_pipeline.py): Combine retriever + LLM (Google’s gemma2-9b-it) for RAG responses.
Streamlit UI (main.py): Simple and interactive interface for querying documents.
Configurable (config.py): Centralized settings for model, database, and pipeline options.
Experiments (notebooks/rag_pipeline.ipynb).

⚙️ Setup

This project uses uv for Python package management.
Make sure you have uv installed first:

pip install uv

Clone the repo and install dependencies:

git clone https://github.com/<your-username>/<repo-name>.git
cd <repo-name>
uv sync

▶️ Usage

Build the databse (this is a onetime setup):

Upload PDFs to the data/pdf_files path
Then run this command

python main.py --build

API Setup:

Get your API key to the gemma2-9b-it model from here groq-api-keys.
Create a .env file in your project root path and assign your API key to GROQ_API_KEY.

Start the Streamlit app in local:

streamlit run app.py

Type your query about a research paper published, and get context-aware answers.

📂 Project Structure

.
├── index_evaluation/        # Similarity search techniques Benchmarking
│   ├── vector_store_interface.py       # Common interface for benchmarking different ANN techniques
├── core/                    # Core components
│   ├── data_loader.py       # PDF loading + chunking
│   ├── embedding_manager.py # Embedding generation
│   ├── retriever.py         # Context retrieval
│   └── vector_store.py      # ChromaDB integration
│
├── data/                    # Input and storage
│   ├── pdf_files/           # Source documents
│   └── vector_store/        # Persisted ChromaDB index
│
├── notebooks/
│   └── rag_pipeline.ipynb   # Experiments & benchmarks
│
├── pipelines/
│   └── rag_pipeline.py      # Full RAG pipeline logic
│
├── config.py                # Global configs
├── main.py                  # Streamlit entry point
├── pyproject.toml           # uv dependencies
├── requirements.txt         # pip fallback
├── uv.lock                  # uv lock file
├── .gitignore
└── README.md

To-Do

Benchmark the retrieval strategies and integrate the best in the Q&A Bot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocQuery: Research Q&A Bot

📊 Application Workflow

🚀 Features

⚙️ Setup

▶️ Usage

📂 Project Structure

To-Do

Reference

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
core		core
data/pdf_files		data/pdf_files
index_evaluation		index_evaluation
notebooks		notebooks
pipelines		pipelines
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

License

neehanthreddym/doc_query_rag

Folders and files

Latest commit

History

Repository files navigation

DocQuery: Research Q&A Bot

📊 Application Workflow

🚀 Features

⚙️ Setup

▶️ Usage

📂 Project Structure

To-Do

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages