LocalRAG

This project implements and evaluates a Retrieval-Augmented Generation (RAG) pipeline optimized for performance on consumer hardware, supporting multiple local language models and diverse datasets.

Project Context

This repository contains the core components of a RAG system that:

Establishes a functional RAG system using locally executable models.
Evaluates its performance thoroughly across retrieval quality, generation quality, and latency metrics.
Delivers a fully reproducible foundation that others can extend with further performance‑improvement techniques.

System Overview

The system uses LangChain to orchestrate a complete RAG pipeline with support for multiple models and datasets:

Datasets: Supports ClapNQ, TriviaQA, and HotpotQA datasets from Hugging Face.
Embedding: Uses intfloat/e5-base-v2 for document and query embedding
Vector Storage: Chroma for efficient local passage indexing and retrieval, with separate indices for each dataset.
Generation: Multiple models via Ollama, including:
- gemma3:1b, gemma3:4b, gemma3:12b
- llama3.2:1b, llama3.2:3b, llama3.1:8b
- olmo2:7b
- Any other models available through Ollama
GPU Utilization: Maximizes GPU usage with ollama_num_gpu=999 setting to offload as many model layers as possible to the GPU.
Evaluation: Comprehensive metrics including:
- Retrieval: NDCG@10, Precision@10, Recall@10 (using ranx)
- Generation: ROUGE scores, Unanswerable Accuracy (using evaluate)
- End-to-End RAG Quality: Faithfulness, Answer Relevancy (using RAGAs with OpenAI's gpt-4.1-mini-2025-04-14 API)
- Performance: Retrieval and Generation Latency (mean, median, P95, P99)

Note on Evaluation Model: While the goal is local execution, RAGAs evaluation using local models proved prohibitively resource-intensive on consumer hardware. Therefore, RAGAs evaluation currently utilizes the OpenAI API for faster and more reliable results. The core generation pipeline remains local.

Prerequisites

Software

Python 3.10 or higher
Git
Ollama (ollama.com) installed and running
An OpenAI API key (optional, for RAGAs evaluation only)

Hardware

Tested on M3 Max with 48GB RAM, but should run (potentially slower) on less powerful machines
At least 20GB of free disk space (for models, dependencies, and vector stores)

Data

All datasets are automatically downloaded from Hugging Face:
- ClapNQ: PrimeQA/clapnq and PrimeQA/clapnq_passages
- TriviaQA: trivia_qa (config: "rc")
- HotpotQA: hotpot_qa (config: "distractor")
Original ClapNQ qrels files are included in the repository under data/retrieval/dev/

Setup

Clone the repository:

git clone https://github.com/patek1/LocalRAG.git
cd LocalRAG

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:
- For development and running:
```
pip install -r requirements.txt
```
- For exact reproducibility (after a successful run): A frozen environment is provided in requirements_frozen.txt:
```
pip install -r requirements_frozen.txt
```
Install and setup Ollama:
- Download and install Ollama from ollama.com.
- Pull the model(s) you want to evaluate:
```
ollama pull gemma3:1b
ollama pull gemma3:12b
ollama pull llama3.1:8b
# ... or any other Ollama models
```
- Ensure the Ollama application is running or start the server:
```
ollama serve & # (Optional, runs in background)
```
- Verify Ollama is running and the models are present:
```
ollama list
```
Set up OpenAI API key (optional, for RAGAs evaluation):
- Copy the example environment file:
```
cp .env.example .env
```
- Edit the .env file and add your OpenAI API key:
```
OPENAI_API_KEY="sk-your-api-key-here"
```
- Note: Using the OpenAI API incurs costs based on usage. If no API key is provided, the system will gracefully skip RAGAs evaluation.
Prepare Data:
- Process and index each dataset you want to evaluate:
```
python scripts/create_index.py --dataset ClapNQ
python scripts/create_index.py --dataset TriviaQA
python scripts/create_index.py --dataset HotpotQA
```
- This will:
  1. Download the datasets from Hugging Face
  2. Process them into standardized formats
  3. Create the necessary files in data/processed_corpora/ and data/processed_questions/
  4. Build vector indices in vectorstores/

Configuration

Key parameters are defined in src/config.py:

Settings: Main configuration class with system-wide parameters
DatasetConfig: Dataset-specific configurations (Hugging Face IDs, field mappings, etc.)
FieldMapping: Maps between raw dataset fields and standardized pipeline fields

The most important settings are now accessible via command-line arguments to the scripts.

Usage

Two-Step Workflow

The LocalRAG pipeline now uses a two-step workflow:

Index each dataset (once per dataset):

source .venv/bin/activate
python scripts/create_index.py --dataset <DATASET_NAME>

Run evaluations with different models:

source .venv/bin/activate
python src/main.py --model <MODEL_TAG> --dataset <DATASET_NAME> [--limit N]

Command-Line Options

`scripts/create_index.py`

--dataset <DATASET_NAME>: (Required) Specifies the dataset to process and index (e.g., "ClapNQ", "TriviaQA", "HotpotQA")
--reindex: (Optional) Forces re-processing and re-indexing of the dataset, even if an index already exists
--debug: (Optional) Limits the dataset size for faster processing during development/testing

Example: Create or update the HotpotQA index:

python scripts/create_index.py --dataset HotpotQA --reindex

`src/main.py`

--model <MODEL_TAG>: (Required) Specifies the Ollama model to use (e.g., "gemma3:1b", "gemma3:12b", "llama3.1:8b")
--dataset <DATASET_NAME>: (Required) Specifies the dataset to use (must be indexed first)
--limit <N>: (Optional) Process only N random questions from the dataset. Useful for quick testing and evaluation on a subset.

Example: Evaluate gemma3:12b on TriviaQA with 50 questions:

python src/main.py --model gemma3:12b --dataset TriviaQA --limit 50

Notes:

Ensure that you have previously indexed the dataset using create_index.py before running main.py
The same --limit N value with the same dataset will always use the same subset of questions for reproducibility
GPU utilization is automatically maximized with the ollama_num_gpu=999 setting

Expected Output

Vector Store: Dataset-specific Chroma indices created in vectorstores/<dataset_name>_corpus/.
Results Files: JSON files saved in the results/<model_tag_safe>/<dataset_name>/ directory:
- <N>_quality_metrics.json: Contains retrieval (NDCG@10, P@10, R@10), generation (ROUGE, Unanswerable Accuracy), and RAGAs metrics for a run with N questions.
- latency_metrics.json: Contains mean, median, P95, P99 latency for retrieval and generation steps, along with the source_subset_size indicating the number of questions used to generate these stats.

Architecture

The project follows a modular structure:

src/: Core source code modules:
- config.py: Central configuration management with dataset configs
- data_loader/: Dataset loading and processing for all supported datasets
- embedding.py: Embedding model with MPS acceleration
- vector_store.py: Chroma vector store management
- indexing.py: Corpus indexing pipeline with batch processing
- prompting.py: Prompt formatting for the generator LLM
- generation.py: Ollama LLM initialization with GPU optimization
- pipeline.py: RAG chain orchestration
- eval/: Evaluation metrics modules
- utils.py: Common utilities
- main.py: Main execution script with CLI argument handling
scripts/: Utility scripts:
- create_index.py: Dataset processing and indexing script
data/: Data directory structure:
- processed_corpora/: Standardized corpus files (generated)
- processed_questions/: Standardized question files (generated)
- subsets/: Question ID subsets for reproducible evaluation (generated)
- Original ClapNQ files used by the ClapNQ processor
vectorstores/: Chroma vector indices for corpus retrieval (generated)
results/: Results organized by model and dataset (generated)

Note: Directories marked as "(generated)" are created during execution and are not included in the repository.

Metrics

The evaluation pipeline calculates and reports:

Retrieval Quality (via ranx):
- NDCG@10
- Precision@10
- Recall@10
Answer Quality (via evaluate and custom logic):
- ROUGE-1, ROUGE-2, ROUGE-L (F-measure) for answerable questions.
- Unanswerable Accuracy: Percentage of unanswerable questions correctly identified as "unanswerable".
- Note: For datasets without explicit unanswerables (TriviaQA, HotpotQA), this reports "N/A".
RAG Quality (via RAGAs using OpenAI, when API key is available):
- Faithfulness: How factually consistent the generated answer is with the retrieved context.
- Answer Relevancy: How relevant the generated answer is to the original question.
Performance:
- Retrieval Latency (Mean, Median, P95, P99) in seconds.
- Generation Latency (Mean, Median, P95, P99) in seconds.

Future Work

This project provides a robust framework for evaluating various RAG configurations. Future work could include:

Implementing additional datasets beyond the current three
Exploring advanced retrieval techniques
Adding more sophisticated prompt engineering
Implementing more comprehensive evaluation metrics

Contact

If you have questions or would like to discuss extending this project, feel free to contact me: Mischa Büchel, mischa@quantflow.tech

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LocalRAG

Project Context

System Overview

Prerequisites

Software

Hardware

Data

Setup

Configuration

Usage

Two-Step Workflow

Command-Line Options

`scripts/create_index.py`

`src/main.py`

Expected Output

Architecture

Metrics

Future Work

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/retrieval/dev		data/retrieval/dev
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements_frozen.txt		requirements_frozen.txt

patek1/LocalRAG

Folders and files

Latest commit

History

Repository files navigation

LocalRAG

Project Context

System Overview

Prerequisites

Software

Hardware

Data

Setup

Configuration

Usage

Two-Step Workflow

Command-Line Options

scripts/create_index.py

src/main.py

Expected Output

Architecture

Metrics

Future Work

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`scripts/create_index.py`

`src/main.py`

Packages