Question
How do we evaluate RAG systems in a way that points to the real failure layer?
Hypothesis
Answer quality metrics are too late. Retrieval coverage, citation quality, and contradiction detection need separate measurement.
Method
Build small benchmark sets with known answer spans, distractors, stale documents, and permission-sensitive content.
Prototype
A RAG evaluation harness that reports retrieval hit rate, answer faithfulness, citation usefulness, and latency budget.
Notes
The hardest cases are often boring: stale docs, missing permissions, and partial overlap between question and source language.
Results / Open Questions
Open question: how to design compact human review workflows that improve eval sets without becoming a labeling project.
References
Placeholder for RAG eval, information retrieval, and citation-grounding papers.