[LAB FILE]

RAG Evaluation Notes

Measuring context before measuring answers.

Question

How do we evaluate RAG systems in a way that points to the real failure layer?

Hypothesis

Answer quality metrics are too late. Retrieval coverage, citation quality, and contradiction detection need separate measurement.

Method

Build small benchmark sets with known answer spans, distractors, stale documents, and permission-sensitive content.

Prototype

A RAG evaluation harness that reports retrieval hit rate, answer faithfulness, citation usefulness, and latency budget.

Notes

The hardest cases are often boring: stale docs, missing permissions, and partial overlap between question and source language.

Results / Open Questions

Open question: how to design compact human review workflows that improve eval sets without becoming a labeling project.

References

Placeholder for RAG eval, information retrieval, and citation-grounding papers.