This repository implements several Retrieval-Augmented Generation (RAG) pipelines on diverse question answering datasets using the DSPy framework. The prompts and few-shot examples in the DSPy modules are optimized using the MIPRO, COPRO, BootstrapFewShot optimizers and DeepEval metrics.
The RAG pipelines are built using:
- DSPy for modular pipeline design and optimization.
- Weaviate vector database for hybrid search and retrieval.
- DeepEval for comprehensive evaluation metrics.
- Confident AI for logging of metrics during optimization.
Each pipeline is configured through YAML files that allow for flexible customization of language models, embedding models, and optimizer hyperparameters.
The project includes implementations for several question answering datasets:
- FreshQA (SealQA): FreshQA is a dynamic QA benchmark that covers a diverse range of question and answer types, with questions that require world knowledge as well as questions with false premises that need to be debunked. SealQA builds on FreshQA with a stronger focus on reasoning.
- HotpotQA: HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts.
- PubMedQA: Biomedical question answering dataset based on PubMed abstracts.
- TriviaQA: TriviaQA is a reading comprehension dataset containing question-answer-evidence triples authored by trivia enthusiasts and independently gathered evidence documents, providing high quality distant supervision for answering the questions.
- Wikipedia: Wikipedia is a large-scale dataset of cleaned articles from all language editions of Wikipedia, sourced from the official Wikipedia dumps.
Each pipeline follows a consistent architecture with the following components:
-
Query Rewriting: The initial
questionis passed to theQueryRewriterto generate a search-optimized query by expanding it with synonyms, clarifying ambiguous terms, and removing conversational noise. -
Sub-Query Generation: The rewritten query is then passed to the
SubQueryGeneratorto decompose it into multiple, more specific sub-queries. This breaks down multi-faceted questions into smaller, self-contained queries that can be executed in parallel, improving retrieval coverage. -
Metadata Extraction: The
MetadataExtractoruses an LLM to parse both the rewritten query and each sub-query to extract structured metadata based on a predefined JSON schema. This structured metadata can then be used for filtering in the retriever to improve retrieval precision. -
Document Retrieval: The
WeaviateRetrieveris called for the main query and each sub-query, using the extracted metadata for filtering. It performs hybrid search combining vector search with keyword-based filtering. The results are aggregated into a single list of passages. -
Answer Generation: The unique, retrieved passages are fed into a
dspy.ChainOfThoughtmodule to generate a final answer and the reasoning behind it. -
Optimization: DSPy optimizers (MIPROv2, COPRO, BootstrapFewShot) automatically tune prompts and select few-shot examples by exploring the space of possible configurations and evaluating them using DeepEval metrics.
-
Logging: Confident AI is used for logging of metrics during optimization.
The project uses uv for dependency management. First, ensure uv is installed:
# Install uv (if not already installed)
pip install uvThen install the project dependencies:
# Install dependencies with all extras and dev dependencies
uv sync --all-extras --dev
# Activate the virtual environment
source .venv/bin/activateCreate a .env file in the project root with the required environment variables:
WEAVIATE_URL=your_weaviate_cluster_url
WEAVIATE_API_KEY=your_weaviate_api_key
GROQ_API_KEY=your_groq_api_keyFor tracing of evaluation runs:
Create a .env.local file in the project root and add your Confident AI API key:
API_KEY=CONFIDENT_API_KEYEach dataset module includes an indexing script to process and store documents in the vector database. The indexing process:
- Loads the dataset from Hugging Face.
- Extracts metadata from each document using an LLM based on the metadata schema defined in the config file.
- Generates vector embeddings using SentenceTransformer model.
- Stores documents, embeddings, and metadata in Weaviate.
Example for FreshQA:
cd src/dspy_opt/freshqa
python freshqa_indexing.pyEach dataset module includes an evaluation script to test the pipeline performance. The evaluation script:
- Loads the pipeline from the saved state.
- Runs predictions on the test dataset.
- Evaluates using DeepEval metrics configured in the YAML file.
- Reports aggregated scores and individual metric results.
Example for FreshQA:
cd src/dspy_opt/freshqa
python freshqa_rag_evaluation.pyEach dataset module includes optimization scripts for different DSPy optimizers. The optimization process:
- Loads the configuration from the YAML file (e.g.,
freshqa_rag_mipro_config.yml). - Initializes all DSPy modules (QueryRewriter, SubQueryGenerator, MetadataExtractor, WeaviateRetriever).
- Loads the training and evaluation datasets.
- Runs the optimizer to compile the pipeline with optimized prompts and few-shot examples.
- Evaluates the optimized pipeline using DeepEval metrics.
Example for FreshQA RAG pipeline optimized using the MIPROv2 optimizer:
cd src/dspy_opt/freshqa
python freshqa_rag_mipro.pyThe QueryRewriter optimizes user queries for better retrieval performance.
- Rewrites queries to be more effective for search engines.
- Expands queries with relevant synonyms and concepts.
- Clarifies ambiguous terms and removes conversational noise.
- Maintains conciseness while preserving key entities and constraints.
The SubQueryGenerator decomposes complex user queries into simpler, more focused sub-queries.
- Breaks down multi-faceted questions into smaller queries.
- Each sub-query addresses a distinct aspect of the original query.
- Sub-queries are self-contained for parallel search execution.
- Improves retrieval coverage for complex information needs.
The MetadataExtractor extracts structured metadata from text using a language model and a user-specified JSON schema.
- Uses LLMs with structured-output generation for metadata extraction.
- Dynamically converts JSON schema into validation structures.
- Only includes successfully extracted (non-null) fields in results.
- Extracted metadata is used for filtering during retrieval.
The WeaviateRetriever connects to a Weaviate vector database for document retrieval.
- Performs hybrid search combining vector search with keyword-based filtering.
- Filters results based on extracted metadata.
The Metrics module integrates DeepEval evaluation metrics into the DSPy optimization framework.
- Creates metric functions compatible with DSPy optimizers.
- Evaluates pipeline performance using multiple metrics:
- Answer Relevancy: Measures how relevant the answer is to the question.
- Faithfulness: Ensures the answer is grounded in the retrieved context.
- Contextual Precision: Evaluates precision of retrieved context.
- Contextual Recall: Measures recall of retrieved context.
- Contextual Relevancy: Assesses overall relevance of retrieved passages.
- Aggregates scores across metrics for optimization objectives.
- Supports async evaluation with configurable throttling.
src/dspy_opt/
├── utils/ # Shared reusable components
│ ├── query_rewriter.py # Query optimization module
│ ├── sub_query_generator.py # Multi-query decomposition
│ ├── metadata_extractor.py # Structured metadata extraction
│ ├── weaviate_retriever.py # Hybrid Search retriever
│ └── metrics.py # DeepEval Metrics Integration
│
├── freshqa/ # FreshQA dataset pipelines
│ ├── freshqa_indexing.py # Index documents to Weaviate
│ ├── freshqa_indexing_config.yml
│ ├── freshqa_rag_module.py # Complete RAG pipeline definition
│ ├── freshqa_rag_mipro.py # MIPRO Optimization
│ ├── freshqa_rag_mipro_config.yml
│ ├── freshqa_rag_copro.py # COPRO Optimization
│ ├── freshqa_rag_copro_config.yml
│ ├── freshqa_rag_bootstrap_few_shot.py
│ ├── freshqa_rag_bootstrap_few_shot_config.yml
│ └── freshqa_rag_evaluation.py # Evaluate optimized pipeline
│
├── hotpotqa/ # HotpotQA dataset pipelines
│ └── ... (similar structure)
│
├── triviaqa/ # TriviaQA dataset pipelines
│ └── ... (similar structure)
│
├── pubmedqa/ # PubMedQA dataset pipelines
│ └── ... (similar structure)
│
└── wikipedia/ # Wikipedia dataset pipelines
└── ... (similar structure)
Please see the CONTRIBUTING.md file for detailed contribution guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
