Thesis
If retrieval fails, answer scoring mostly measures how gracefully the model guessed.
Notes
A useful RAG eval should separate retrieval coverage, source relevance, citation accuracy, answer faithfulness, and latency. The pipeline needs blame assignment.
Working Claim
Evaluation is a debugging interface, not just a leaderboard.