Skip to content

feat: Addition of Risk-Control Metrics for Trustworthy RAG Evaluation #2279

@AlanPonnachan

Description

@AlanPonnachan

Describe the Feature

This proposal introduces a new, cohesive suite of four interconnected and academically-grounded metrics designed to evaluate a critical dimension currently missing from ragas: a RAG system's ability to control risk by knowing when to refuse to answer.

Instead of only evaluating the quality of a generated answer, these metrics evaluate the quality of the RAG system's preceding meta-decision: whether to answer a question ("keep") or to proactively abstain ("discard") when confidence is low or the context is insufficient.

The four proposed metrics are:

  • Risk: Measures the probability that an answer the system chose to provide is actually a "risky" one (i.e., from an unanswerable context). A lower score is better, directly quantifying the system's safety.
  • Carefulness: Measures the system's ability to correctly identify and discard questions that are unanswerable from the context. This is effectively the recall of the "abstain" decision. A higher score is better.
  • Alignment: Measures the overall accuracy of the keep/discard decision-making process, providing a holistic view of the system's judgment.
  • Coverage: Measures the proportion of questions the system attempts to answer, quantifying its "helpfulness" or utility.

Together, these metrics provide developers with a "dashboard" to understand and tune the fundamental trade-off between system safety (low Risk) and helpfulness (high Coverage).

Why is the feature important for you?

The current ragas library is excellent for the post-hoc evaluation of RAG answer quality. However, as RAG systems move from research prototypes to production-grade, enterprise, and safety-critical applications, the most dangerous failure mode is a confidently incorrect answer (hallucination). A system that safely says "I don't know" is far more valuable and trustworthy than one that guesses.

This feature is critical because it addresses this "evaluation gap" by:

  1. Shifting the Paradigm: It moves from a purely "best-effort" evaluation model (judging the quality of an attempt) to a "safety-first" model that also judges the wisdom of making the attempt at all.
  2. Enabling Trustworthy AI: It gives developers the tools to measure and optimize for reliability. One can now set concrete goals like, "Maintain Coverage above 80% while ensuring Risk remains below 5%."
  3. Aligning with Real-World Needs: In domains like finance, legal, and medical tech, preventing the propagation of misinformation is paramount. These metrics make ragas an even more indispensable tool for developers in these high-stakes fields.

Additional context

Academic Grounding
These metrics are directly derived from the peer-reviewed research paper "Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework" by Chen et al., presented at EMNLP 2024 Findings. This provides a rigorous, scientifically validated foundation for their inclusion in ragas.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestmodule-metricsthis is part of metrics module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions