feat: Addition of Risk-Control Metrics for Trustworthy RAG Evaluation

### Describe the Feature

This proposal introduces a new, cohesive suite of four interconnected and academically-grounded metrics designed to evaluate a critical dimension currently missing from `ragas`: **a RAG system's ability to control risk by knowing when to refuse to answer.**

Instead of only evaluating the quality of a *generated answer*, these metrics evaluate the quality of the RAG system's preceding *meta-decision*: whether to answer a question ("keep") or to proactively abstain ("discard") when confidence is low or the context is insufficient.

The four proposed metrics are:
*   **Risk:** Measures the probability that an answer the system chose to provide is actually a "risky" one (i.e., from an unanswerable context). A lower score is better, directly quantifying the system's safety.
*   **Carefulness:** Measures the system's ability to correctly identify and discard questions that are unanswerable from the context. This is effectively the recall of the "abstain" decision. A higher score is better.
*   **Alignment:** Measures the overall accuracy of the keep/discard decision-making process, providing a holistic view of the system's judgment.
*   **Coverage:** Measures the proportion of questions the system attempts to answer, quantifying its "helpfulness" or utility.

Together, these metrics provide developers with a "dashboard" to understand and tune the fundamental trade-off between system safety (low Risk) and helpfulness (high Coverage).

### Why is the feature important for you?

The current `ragas` library is excellent for the post-hoc evaluation of RAG answer quality. However, as RAG systems move from research prototypes to **production-grade, enterprise, and safety-critical applications**, the most dangerous failure mode is a confidently incorrect answer (hallucination). A system that safely says "I don't know" is far more valuable and trustworthy than one that guesses.

This feature is critical because it addresses this "evaluation gap" by:

1.  **Shifting the Paradigm:** It moves  from a purely "best-effort" evaluation model (judging the quality of an attempt) to a "safety-first" model that also judges the wisdom of making the attempt at all.
2.  **Enabling Trustworthy AI:** It gives developers the tools to measure and optimize for reliability. One can now set concrete goals like, "Maintain Coverage above 80% while ensuring Risk remains below 5%."
3.  **Aligning with Real-World Needs:** In domains like finance, legal, and medical tech, preventing the propagation of misinformation is paramount. These metrics make `ragas` an even more indispensable tool for developers in these high-stakes fields.



### Additional context

**Academic Grounding**
These metrics are directly derived from the peer-reviewed research paper **"Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework"** by Chen et al., presented at EMNLP 2024 Findings. This provides a rigorous, scientifically validated foundation for their inclusion in `ragas`.

*   **Paper Link:** [Controlling Risk of Retrieval-augmented Generation](https://arxiv.org/abs/2409.16146)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Addition of Risk-Control Metrics for Trustworthy RAG Evaluation #2279

Describe the Feature

Why is the feature important for you?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

feat: Addition of Risk-Control Metrics for Trustworthy RAG Evaluation #2279

Description

Describe the Feature

Why is the feature important for you?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions