Beyond RAGAS: Claim-Based Evaluation Strategies for RAG Pipelines #41

karthik18495 · 2025-07-30T01:33:20Z

karthik18495
Jul 30, 2025
Maintainer

Context

The current RAG pipeline leverages RAGAS for automatic evaluation of retrieval-augmented responses. However, RAGAS mainly focuses on metrics like faithfulness, answer relevance, context precision, etc., and does not explicitly model the complexity of multi-claim questions or the structured decomposition of answers.

In real-world research and scientific Q&A, queries may be:

Simple (single or few claims): "When and where was Newton born?" (2 claims: date and place)
Complex (requires decomposition): "Write down the parameters of GlueX central detector." (Requires enumerating all relevant parameters and their values, possibly using reasoning or multi-hop lookups)

Brainstorming Points

1. Claim Decomposition

For each question, use an LLM or heuristics to automatically decompose it into a list of atomic claims.
- For example:
  "When and where was Newton born?" ⟶ ["When was Newton born?", "Where was Newton born?"]
For complex questions, the system should infer the number and nature of claims (e.g., "List all parameters of X" ⟶ enumerate parameters).

2. Claim-Level Evaluation

Evaluate RAG responses at the claim level:
- Coverage: Does the response answer all claims in the question?
- Correctness: Is each claim in the response factually correct and supported by retrieved context?
- Granularity: For list-type or enumeration questions, does the response provide all required elements?

3. Automatic Claim Extraction from Answers

Use LLMs to extract claims from both the ground-truth answers and the RAG-generated answers.
Align claims between reference and prediction for fine-grained evaluation.

4. Metrics

Precision/Recall/F1 at claim level: Compare claims in reference vs. response.
Faithfulness: Is each claim supported by the retrieved context (can use RAGAS faithfulness)?
Completeness: Are all expected claims covered?
Over-generation: Are there extra, hallucinated, or irrelevant claims in the response?

5. Dataset Construction

For evaluation, annotate a dataset with:
- Input question
- List of gold-standard claims for each question
- Reference answer (optional)
- Allow for both simple and complex/multi-step questions

6. Complex Questions

For questions requiring reasoning, enumeration, or chained retrievals:
- Is the system able to identify and answer all sub-claims?
- Can the pipeline handle "open-ended" requests (e.g., "List all X")?

7. Automation

Integrate claim decomposition and evaluation into the Langchain/LangGraph pipeline so that during test runs:
- Claims are extracted for both question and response
- Automated scoring is performed
- Optionally, human-in-the-loop validation for ambiguous cases

8. Visualization/Reporting

Visualize claim coverage, correctness, and error types for each response.
Provide breakdowns for single-claim vs. multi-claim questions.

9. Comparison with RAGAS

How do claim-based metrics align with or complement RAGAS metrics?
Can claim-level evaluation be integrated into RAGAS or used as an additional signal?

Questions for the Community

What are effective methods for extracting and matching claims in scientific text?
How should we handle ambiguous or open-ended questions in evaluation?
Are there public datasets or benchmarks that follow a claim-decomposition approach for QA?
How can we best visualize and summarize claim-wise evaluation for model developers?

This issue is intended for brainstorming and community input on advancing RAG evaluation beyond existing RAGAS metrics, with a focus on claim-based and multi-hop QA evaluation.

onestardao · 2025-07-31T01:06:52Z

onestardao
Jul 31, 2025

whoa this is actually a rly sharp breakdown.
most folks just scream "hallucination bad" and call it a day — but you’re digging into claim-level granularity like it matters. love that.

i’ve been neck-deep building evaluation logic too, mostly cuz i got tired of vague metrics that say "looks ok?" but fail silently on logic collapse.
turns out a lotta models can pass ragas-style metrics and still spit out ✨ nonsense with high confidence ✨.

anyway — i won’t derail w/ my own stuff, but i’ve been compiling a giant problem map of rag failures & weird edge-cases (like pointer drift, incomplete logic trails, fragment mismatches etc).
lemme know if you ever feel like comparing notes — some of the “quiet failures” might overlap w/ what you're decomposing here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Beyond RAGAS: Claim-Based Evaluation Strategies for RAG Pipelines #41

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Beyond RAGAS: Claim-Based Evaluation Strategies for RAG Pipelines #41

Uh oh!

karthik18495 Jul 30, 2025 Maintainer

Context

Brainstorming Points

1. Claim Decomposition

2. Claim-Level Evaluation

3. Automatic Claim Extraction from Answers

4. Metrics

5. Dataset Construction

6. Complex Questions

7. Automation

8. Visualization/Reporting

9. Comparison with RAGAS

Questions for the Community

Replies: 1 comment

Uh oh!

onestardao Jul 31, 2025

karthik18495
Jul 30, 2025
Maintainer

onestardao
Jul 31, 2025