🔎 RAG-Systems-Lab

A modular Retrieval-Augmented Generation (RAG) experimentation framework focused on benchmarking lexical, semantic, hybrid, and reranked retrieval strategies using standard Information Retrieval metrics.

This repository is designed as a foundation for building a future Agentic RAG system, starting with rigorous retrieval evaluation.

📁 Project Structure

RAG-Systems-Lab/

data/ — PDF documents used to build the RAG knowledge base
assets/ — Evaluation screenshots and retrieval comparisons
main.ipynb — Retrieval pipeline and benchmarking logic
requirements.txt — Project dependencies
README.md — Documentation

🎯 Project Objective

The goal of this project is to:

Compare multiple retrieval strategies
Evaluate using Recall@1, Recall@5, and MRR
Analyze ranking weaknesses
Improve top-1 accuracy with reranking
Prepare architecture for future agentic extensions

🔍 Retrieval Methods Compared

1️⃣ BM25 (Lexical Search)

Keyword-based ranking.

2️⃣ Vector Retrieval

Dense embedding-based semantic search.

3️⃣ Hybrid Retrieval

BM25 + Vector similarity.

4️⃣ Hybrid + Reranker

Hybrid retrieval followed by neural reranking.

📊 Evaluation Metrics

Each query is mapped to a known correct document chunk.

Metrics computed:

Recall@1 → Is the correct chunk ranked first?
Recall@5 → Is the correct chunk within top 5?
MRR → Measures ranking quality

📈 Experimental Results

🔹 Retrieval Verification

Example retrieval test showing correct chunk detection within top results:

🔹 BM25 vs Vector vs Hybrid Metrics

Comparison of Recall@1, Recall@5, and MRR across retrievers:

🔹 Additional Query Evaluation

Harder query evaluation demonstrating ranking behavior:

🔹 Hybrid vs Hybrid + Reranker

Adding reranking improves Recall@1 and MRR:

🧠 Key Insights

Recall@5 alone is insufficient to judge retrieval quality.
Vector retrieval significantly improves semantic matching.
Hybrid search improves coverage but not always top-rank precision.
Reranking meaningfully improves Recall@1.
MRR reflects ranking improvements clearly.

▶️ How to Run

1️⃣ Place Your Documents

Put your PDF files inside the data/ folder:

data/ ├── document1.pdf
├── document2.pdf
└── ...

2️⃣ Install Dependencies

Run:

pip install -r requirements.txt

3️⃣ Run the Pipeline

Open:

main.ipynb

Execute all cells sequentially to:

Index PDFs
Create embeddings
Run BM25 / Vector / Hybrid retrieval
Evaluate using Recall@1, Recall@5, MRR
Compare Hybrid vs Hybrid + Reranker

✅ Expected Output

You will see:

Retrieved document chunks
Ranking comparisons
Metric scores
Performance differences across retrievers

🚀 Roadmap (Planned Agentic Extensions)

Query rewriting module
Multi-hop retrieval
Tool-based reasoning
Retriever selection agent
Self-correcting retrieval loop

This repository is structured to evolve into a fully agentic RAG system.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Data		Data
assets		assets
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔎 RAG-Systems-Lab

📁 Project Structure

🎯 Project Objective