RAG Evaluation Notes

Question

How do we evaluate RAG systems in a way that points to the real failure layer?

Answer quality metrics are too late. Retrieval coverage, citation quality, and contradiction detection need separate measurement.

Build small benchmark sets with known answer spans, distractors, stale documents, and permission-sensitive content.

A RAG evaluation harness that reports retrieval hit rate, answer faithfulness, citation usefulness, and latency budget.

The hardest cases are often boring: stale docs, missing permissions, and partial overlap between question and source language.

Open question: how to design compact human review workflows that improve eval sets without becoming a labeling project.

Placeholder for RAG eval, information retrieval, and citation-grounding papers.