RAG Evaluation

Thesis

If retrieval fails, answer scoring mostly measures how gracefully the model guessed.

Notes

A useful RAG eval should separate retrieval coverage, source relevance, citation accuracy, answer faithfulness, and latency. The pipeline needs blame assignment.

Working Claim

Evaluation is a debugging interface, not just a leaderboard.