Beyond RAGAS: Claim-Based Evaluation Strategies for RAG Pipelines #41
Replies: 1 comment
-
|
whoa this is actually a rly sharp breakdown. i’ve been neck-deep building evaluation logic too, mostly cuz i got tired of vague metrics that say "looks ok?" but fail silently on logic collapse. anyway — i won’t derail w/ my own stuff, but i’ve been compiling a giant problem map of rag failures & weird edge-cases (like pointer drift, incomplete logic trails, fragment mismatches etc). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
The current RAG pipeline leverages RAGAS for automatic evaluation of retrieval-augmented responses. However, RAGAS mainly focuses on metrics like faithfulness, answer relevance, context precision, etc., and does not explicitly model the complexity of multi-claim questions or the structured decomposition of answers.
In real-world research and scientific Q&A, queries may be:
Brainstorming Points
1. Claim Decomposition
"When and where was Newton born?" ⟶ ["When was Newton born?", "Where was Newton born?"]
2. Claim-Level Evaluation
3. Automatic Claim Extraction from Answers
4. Metrics
5. Dataset Construction
6. Complex Questions
7. Automation
8. Visualization/Reporting
9. Comparison with RAGAS
Questions for the Community
This issue is intended for brainstorming and community input on advancing RAG evaluation beyond existing RAGAS metrics, with a focus on claim-based and multi-hop QA evaluation.
Beta Was this translation helpful? Give feedback.
All reactions