Ch 15. What to Evaluate¶
What you'll learn
- AI systems without evaluation are unreliable — why you need numbers, not hunches
- Three layers of evaluation — Retrieval · Generation · End-to-End
- Offline (pre-deployment) vs. Online (post-deployment) — the same metric means different things at each stage
- A metric catalog — which signal matters when
- Common traps: single-metric obsession, "bigger models are always better," and test set leakage
Prerequisites
Part 2 (Ch 4–Ch 8) and Part 3 (Ch 9–Ch 14). You've assembled a prototype RAG or tool-calling system.
1. Concept — The "it works" illusion¶
You ship a chatbot. A user asks "what's the refund policy?" and gets a plausible answer. You try three more times. All plausible. Let's deploy.
That's the trap. LLM outputs are probabilistic and sensitive to input. Manual testing on a handful of examples hides the real questions:
- Does hallucination only hit certain question types?
- Is the retriever pulling the wrong doc, and the LLM is just papering over it?
- Are dates and numbers subtly wrong?
- Did you tweak the prompt once and break something else without noticing?
Evaluation is the process of turning these unknowns into repeatable numbers. It's your test suite. It's your A/B experiment. Without it, your system stays in "got lucky" mode forever.
2. Why it matters — Three concrete reasons¶
① Catching regressions. Every time you change a prompt line, swap model versions, or tweak a retriever parameter, you need to know if you broke something that used to work. Manual testing: hard to do 10 cases. An eval script: 100 cases in 30 seconds.
② Making decisions. "Should we use Haiku instead of Sonnet?" "Does adding a reranker help?" Without numbers, it's all gut feel. With Recall@5 jumping from 0.62 to 0.81, the choice is clear.
③ Building trust. In enterprise settings, legal, security, and operations teams will ask: "How often does this get it wrong?" You need to say Faithfulness: 0.93. "It works pretty well" doesn't close the conversation.
3. Where evaluation fits — Three layers¶
A RAG system has at least two blocks: search and generation. The user's experience is the sum of both. So you measure at three points in time.
3-1. Retrieval layer — Did we pull the right documents?¶
This happens before the LLM. Question → top-k docs. That's it.
| Metric | Meaning | When to use |
|---|---|---|
| Recall@k | Fraction of questions where the gold doc is in top-k | Measure search coverage |
| Precision@k | Fraction of top-k results that are actually relevant | When top-k is too broad |
| MRR (Mean Reciprocal Rank) | Average rank position of the right answer | When position matters — is it #1 or buried? |
| nDCG | Ranking score weighted by relevance | When relevance is a spectrum (highly, somewhat) |
3-2. Generation layer — Did we build the right answer from those docs?¶
Fix the retrieval results (assume they're correct). Now measure only: does the LLM output match the documents?
| Metric | Meaning | When to use |
|---|---|---|
| Faithfulness | Fraction of answer supported by the provided docs | Catch hallucination (the whole point of RAG) |
| Correctness | Semantic match to the ground truth | When there's a clear right answer (QA) |
| Coherence | Logical flow and readability | For summaries and long-form responses |
| Groundedness | Strict version of Faithfulness — sentence-level accuracy | Medical, legal, high-stakes domains |
3-3. End-to-End layer — Did the user actually get what they needed?¶
Run the whole pipeline once. Score only the final output.
| Metric | Meaning | When to use |
|---|---|---|
| Helpfulness | Usefulness (human score or LLM judge) | Most common E2E metric |
| Task Success | Did the user's goal actually get accomplished | Agents and multi-step workflows |
| Safety | Absence of harmful, discriminatory, or refusal failures | Operational necessity |
| User Feedback | Thumbs up/down, re-queries, abandonment | Online signals |
Why three layers?
If only E2E is low, you don't know where to fix. But if Retrieval scores 0.85, Generation scores 0.80, and E2E is 0.40, you've just learned: fix the prompt. If Retrieval is 0.30 but the others are fine, add a reranker. Layered metrics tell you where to dig.
4. Minimal example — Recall@k in 30 lines¶
First, the intuition: evaluation isn't complex. Let's run a retrieval eval in a single script.
- Gold set — question + correct doc ID pairs. Start with 3; grow to 30–100 in Ch 16.
- recall@k — "is the right doc anywhere in the top-k?" Binary. For ranked position, use MRR instead.
That's all. One 3-line eval function. You just converted "does my retriever work?" into a number that changes when you modify embeddings, chunking, or retrieval parameters.
5. In practice — Offline vs. Online¶
Evaluation has two axes. Do only one and you'll find a blind spot.
5-1. Offline — Before deployment, fixed data¶
When: Every time you change a prompt, model, or parameter. Ideally in CI/CD, blocking bad merges.
Data: A versioned gold set (Chapter 16 in detail). Don't change it mid-experiment or comparisons break.
Upsides: reproducible, fast, cheap. Once written, you run it hundreds of times.
Blindspots: doesn't measure real-world cases outside your test set.
5-2. Online — After deployment, real users¶
When: Canary (5% traffic) · A/B tests · continuous monitoring.
Data: User logs and feedback. Sample from the live stream.
| Signal | Meaning |
|---|---|
| 👍/👎 | Intuitive but low response rates (5–15%) |
| Re-query rate | User asks a follow-up immediately — lower is better |
| Session length | Number of turns to solve a task (for agents) |
| Abandonment | User leaves mid-response |
| Manual review samples | Daily: sample 100 cases, humans rate on a 3-point scale |
Upsides: real usage · catches long-term trends.
Blindspots: noisy, hard to isolate cause, feedback lag.
5-3. Close the loop: offline and online are partners¶
Failed cases found online get added to your offline gold set. This is the Ch 16 operating loop.
6. Common failure modes¶
6-1. Obsession with a single metric¶
"Look, our Faithfulness is 0.91!" But Recall@5 is 0.30. Meaning: the LLM is faithfully answering the wrong document. Every layer needs at least one metric.
6-2. "Bigger model = better"¶
Sonnet sometimes scores lower Faithfulness than Haiku. Larger models are more confident at inference, filling gaps with plausible-sounding reasoning instead of admitting uncertainty. You'll swap Sonnet for Haiku and get better results. Evaluation reveals this. Guessing doesn't.
6-3. Test set leakage¶
You tweak the prompt bit by bit. Over time, it's not the model that's overfit — you are, to your eval set. Fix: hold out a separate set (30–50 unseen examples) and score against that at the end.
6-4. Deploy on offline scores alone¶
Your offline eval hit 0.92. Users ask different questions. Deploy to canary (5% traffic) and monitor for at least 3 days. If offline and online metrics diverge, that's a signal.
7. Operational checklist¶
- Picked at least one metric per layer (Retrieval · Generation · E2E)
- Gold set is versioned (git/DVC) and labeling guidelines are documented
- Eval script runs in CI and blocks merges on threshold violations
- Online dashboard exists (Grafana, Langfuse) before canary launch
- Failure case loop is live: 👎 or low judge scores → weekly review → gold set refresh
- LLM judge API cost is in budget (it's not cheap if you score thousands daily)
- Hold-out set exists (defend against eval set overfitting)
- New features include "how we evaluate this" in the PR description
8. Exercises and next steps¶
Review¶
- For each of the three layers (Retrieval, Generation, E2E), pick one metric and explain in one sentence why you chose it.
- Your E2E score is low, but Retrieval and Generation scores are fine. Where's the bug?
- Offline scores are stable, but online metrics dropped. Give three hypotheses.
- Why is a single metric (e.g., Faithfulness alone) dangerous? Use a concrete example.
Hands-on¶
- Take the QA prompt from Ch 5. Pick 10 questions, hand-label gold answers, then score
correctnessusing Claude Haiku as judge. Modify the prompt once and log the score change.
Further reading¶
- Stanford CS329A Lec 17 — Agentic Evaluations · Long-Horizon Tasks. See project
_research/stanford-cs329a.md - Stanford CME 295 Lec 8 — Evaluation & LLM-as-a-Judge. See project
_research/stanford-cme295.md - Ragas official metric catalog (faithfulness · answer_relevancy · context_precision)
Next → Ch 16. Building Eval Sets — How to actually construct and maintain 30–100 gold examples in practice