This project implements and evaluates a Retrieval-Augmented Generation (RAG) pipeline optimized for performance on consumer hardware, supporting multiple local language models and diverse datasets.
This repository contains the core components of a RAG system that:
- Establishes a functional RAG system using locally executable models.
- Evaluates its performance thoroughly across retrieval quality, generation quality, and latency metrics.
- Delivers a fully reproducible foundation that others can extend with further performance‑improvement techniques.
The system uses LangChain to orchestrate a complete RAG pipeline with support for multiple models and datasets:
-
Datasets: Supports ClapNQ, TriviaQA, and HotpotQA datasets from Hugging Face.
-
Embedding: Uses
intfloat/e5-base-v2for document and query embedding -
Vector Storage: Chroma for efficient local passage indexing and retrieval, with separate indices for each dataset.
-
Generation: Multiple models via Ollama, including:
gemma3:1b,gemma3:4b,gemma3:12bllama3.2:1b,llama3.2:3b,llama3.1:8bolmo2:7b- Any other models available through Ollama
-
GPU Utilization: Maximizes GPU usage with
ollama_num_gpu=999setting to offload as many model layers as possible to the GPU. -
Evaluation: Comprehensive metrics including:
- Retrieval: NDCG@10, Precision@10, Recall@10 (using
ranx) - Generation: ROUGE scores, Unanswerable Accuracy (using
evaluate) - End-to-End RAG Quality: Faithfulness, Answer Relevancy (using
RAGAswith OpenAI'sgpt-4.1-mini-2025-04-14API) - Performance: Retrieval and Generation Latency (mean, median, P95, P99)
- Retrieval: NDCG@10, Precision@10, Recall@10 (using
Note on Evaluation Model: While the goal is local execution, RAGAs evaluation using local models proved prohibitively resource-intensive on consumer hardware. Therefore, RAGAs evaluation currently utilizes the OpenAI API for faster and more reliable results. The core generation pipeline remains local.
- Python 3.10 or higher
- Git
- Ollama (ollama.com) installed and running
- An OpenAI API key (optional, for RAGAs evaluation only)
- Tested on M3 Max with 48GB RAM, but should run (potentially slower) on less powerful machines
- At least 20GB of free disk space (for models, dependencies, and vector stores)
- All datasets are automatically downloaded from Hugging Face:
- ClapNQ:
PrimeQA/clapnqandPrimeQA/clapnq_passages - TriviaQA:
trivia_qa(config: "rc") - HotpotQA:
hotpot_qa(config: "distractor")
- ClapNQ:
- Original ClapNQ qrels files are included in the repository under
data/retrieval/dev/
-
Clone the repository:
git clone https://github.com/patek1/LocalRAG.git cd LocalRAG -
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install dependencies:
- For development and running:
pip install -r requirements.txt
- For exact reproducibility (after a successful run): A frozen environment is provided in
requirements_frozen.txt:pip install -r requirements_frozen.txt
- For development and running:
-
Install and setup Ollama:
- Download and install Ollama from ollama.com.
- Pull the model(s) you want to evaluate:
ollama pull gemma3:1b ollama pull gemma3:12b ollama pull llama3.1:8b # ... or any other Ollama models - Ensure the Ollama application is running or start the server:
ollama serve & # (Optional, runs in background)
- Verify Ollama is running and the models are present:
ollama list
-
Set up OpenAI API key (optional, for RAGAs evaluation):
- Copy the example environment file:
cp .env.example .env
- Edit the
.envfile and add your OpenAI API key:OPENAI_API_KEY="sk-your-api-key-here" - Note: Using the OpenAI API incurs costs based on usage. If no API key is provided, the system will gracefully skip RAGAs evaluation.
- Copy the example environment file:
-
Prepare Data:
- Process and index each dataset you want to evaluate:
python scripts/create_index.py --dataset ClapNQ python scripts/create_index.py --dataset TriviaQA python scripts/create_index.py --dataset HotpotQA
- This will:
- Download the datasets from Hugging Face
- Process them into standardized formats
- Create the necessary files in
data/processed_corpora/anddata/processed_questions/ - Build vector indices in
vectorstores/
- Process and index each dataset you want to evaluate:
Key parameters are defined in src/config.py:
Settings: Main configuration class with system-wide parametersDatasetConfig: Dataset-specific configurations (Hugging Face IDs, field mappings, etc.)FieldMapping: Maps between raw dataset fields and standardized pipeline fields
The most important settings are now accessible via command-line arguments to the scripts.
The LocalRAG pipeline now uses a two-step workflow:
-
Index each dataset (once per dataset):
source .venv/bin/activate python scripts/create_index.py --dataset <DATASET_NAME>
-
Run evaluations with different models:
source .venv/bin/activate python src/main.py --model <MODEL_TAG> --dataset <DATASET_NAME> [--limit N]
--dataset <DATASET_NAME>: (Required) Specifies the dataset to process and index (e.g., "ClapNQ", "TriviaQA", "HotpotQA")--reindex: (Optional) Forces re-processing and re-indexing of the dataset, even if an index already exists--debug: (Optional) Limits the dataset size for faster processing during development/testing
Example: Create or update the HotpotQA index:
python scripts/create_index.py --dataset HotpotQA --reindex--model <MODEL_TAG>: (Required) Specifies the Ollama model to use (e.g., "gemma3:1b", "gemma3:12b", "llama3.1:8b")--dataset <DATASET_NAME>: (Required) Specifies the dataset to use (must be indexed first)--limit <N>: (Optional) Process only N random questions from the dataset. Useful for quick testing and evaluation on a subset.
Example: Evaluate gemma3:12b on TriviaQA with 50 questions:
python src/main.py --model gemma3:12b --dataset TriviaQA --limit 50Notes:
- Ensure that you have previously indexed the dataset using
create_index.pybefore runningmain.py - The same
--limit Nvalue with the same dataset will always use the same subset of questions for reproducibility - GPU utilization is automatically maximized with the
ollama_num_gpu=999setting
-
Vector Store: Dataset-specific Chroma indices created in
vectorstores/<dataset_name>_corpus/. -
Results Files: JSON files saved in the
results/<model_tag_safe>/<dataset_name>/directory:<N>_quality_metrics.json: Contains retrieval (NDCG@10, P@10, R@10), generation (ROUGE, Unanswerable Accuracy), and RAGAs metrics for a run with N questions.latency_metrics.json: Contains mean, median, P95, P99 latency for retrieval and generation steps, along with thesource_subset_sizeindicating the number of questions used to generate these stats.
The project follows a modular structure:
-
src/: Core source code modules:config.py: Central configuration management with dataset configsdata_loader/: Dataset loading and processing for all supported datasetsembedding.py: Embedding model with MPS accelerationvector_store.py: Chroma vector store managementindexing.py: Corpus indexing pipeline with batch processingprompting.py: Prompt formatting for the generator LLMgeneration.py: Ollama LLM initialization with GPU optimizationpipeline.py: RAG chain orchestrationeval/: Evaluation metrics modulesutils.py: Common utilitiesmain.py: Main execution script with CLI argument handling
-
scripts/: Utility scripts:create_index.py: Dataset processing and indexing script
-
data/: Data directory structure:processed_corpora/: Standardized corpus files (generated)processed_questions/: Standardized question files (generated)subsets/: Question ID subsets for reproducible evaluation (generated)- Original ClapNQ files used by the ClapNQ processor
-
vectorstores/: Chroma vector indices for corpus retrieval (generated) -
results/: Results organized by model and dataset (generated)
Note: Directories marked as "(generated)" are created during execution and are not included in the repository.
The evaluation pipeline calculates and reports:
- Retrieval Quality (via
ranx):- NDCG@10
- Precision@10
- Recall@10
- Answer Quality (via
evaluateand custom logic):- ROUGE-1, ROUGE-2, ROUGE-L (F-measure) for answerable questions.
- Unanswerable Accuracy: Percentage of unanswerable questions correctly identified as "unanswerable".
- Note: For datasets without explicit unanswerables (TriviaQA, HotpotQA), this reports "N/A".
- RAG Quality (via
RAGAsusing OpenAI, when API key is available):- Faithfulness: How factually consistent the generated answer is with the retrieved context.
- Answer Relevancy: How relevant the generated answer is to the original question.
- Performance:
- Retrieval Latency (Mean, Median, P95, P99) in seconds.
- Generation Latency (Mean, Median, P95, P99) in seconds.
This project provides a robust framework for evaluating various RAG configurations. Future work could include:
- Implementing additional datasets beyond the current three
- Exploring advanced retrieval techniques
- Adding more sophisticated prompt engineering
- Implementing more comprehensive evaluation metrics
If you have questions or would like to discuss extending this project, feel free to contact me: Mischa Büchel, mischa@quantflow.tech