Ch 17. LLM-as-a-Judge¶
What you'll learn
- Judge design — pairwise comparison vs. rubric scoring
- Four biases — position · length · self-preference · verbosity
- Human calibration — validating Judge reliability through agreement rates
- Building a Judge in Claude and measuring bias with A/B flips
- The mindset: never treat a Judge's verdict as ground truth
Prerequisites
Ch 15–Ch 16. You have a gold set, and you need a way to score how similar an LLM response is to it.
1. Concept — Who scores the answer?¶
You've got correct answers in your gold set. But LLM responses don't spit out exact character matches. The meaning is right, but the wording differs. How do you score that?
- Exact match — nearly useless. "March 2024" vs "2024/03" fails
- BLEU/ROUGE — n-gram overlap. Depends on phrasing luck, not meaning
- Embedding similarity — catches the topic, misses factuality
- Humans — accurate but slow and expensive
LLM-as-a-Judge automates the fourth option. You let a different LLM either pick the better of two answers or assign a 1–5 score.
Judge(question, answer_A, answer_B, rubric) → 'A' or 'B' or 'tie'
Judge(question, answer, rubric) → score: 1..5
The key: this isn't manufacturing fake ground truth. It's approximating human judgment. That's why validating the Judge's reliability is non-negotiable.
2. Why you need it¶
- Scale: 100 human evaluations take hours. 100 Judge calls take minutes and cost $0.50.
- Consistency: humans drift with mood and fatigue. Judge outputs the same result for the same input.
- Fast iteration: tweak a prompt line and get the score instantly.
But there are limits. Judges have stronger biases in certain areas than humans do, and they favor their own model family (§5). So "Judge + periodic human sampling" is the standard pattern.
3. Where it's used — two modes¶
3.1 Pairwise comparison¶
"Which of A or B is better?" — single decision, high consistency. Best for model and prompt A/B tests.
3.2 Rubric scoring¶
"Rate this answer 1–5 on accuracy · completeness · brevity." Absolute grading. Good for regression tests and time-series tracking.
| Mode | Pros | Cons | When |
|---|---|---|---|
| Pairwise | Aligns with human judgment · simple | O(N²) comparison cost | model · prompt comparison |
| Rubric | O(N) · absolute scores · axis-wise analysis | Judge criteria drift | regression · operational monitoring |
In practice you use both. Rubric is your main metric, pairwise is the tiebreaker on important calls.
4. Minimal example — Claude as a pairwise judge¶
- Judge function signature — takes question, reference doc, two answers, returns a JSON verdict.
- Haiku is enough — judging is a comparative task, so a lightweight model works fine. If agreement drops below ~0.75, upgrade to Sonnet.
5. Hands-on — measuring and correcting bias¶
5.1 Position bias — A/B flip test¶
The most critical check. Run the same A·B pair in reverse order, then measure agreement.
Below 0.85? Emphasize "content only, position-blind" in the rubric, or upgrade the Judge model.
5.2 Length bias — correlation check¶
Measure the correlation between response length and Judge score. If r ≥ 0.3, the Judge thinks "longer is better."
| length_correlation.py | |
|---|---|
Fix: add "brevity 1–5" as an explicit axis in the rubric and weight it in the total.
5.3 Human calibration — agreement rate¶
Run a weekly sample (10–20 cases) where humans independently grade, then compute alignment with the Judge.
| human_agreement.py | |
|---|---|
- ≥ 0.80: Judge scores are safe for offline metrics
- 0.60–0.80: caution. high-stakes decisions need humans
- < 0.60: redesign the Judge. clarify the rubric or swap the model
5.4 Self-preference defense¶
Claude Judges favor Claude answers. Mitigations:
- Use a different model family as Judge (e.g., if responses are from Claude, Judge with GPT)
- Or run dual judges and escalate disagreements to humans
6. Common pitfalls¶
6.1 Treating the Judge as ground truth¶
Mistake: "Judge score went up, so quality improved" in the report. Without human calibration, the Judge's bias becomes your metric.
Rule: report Judge scores alongside agreement rate. If agreement drops, distrust the Judge result too.
6.2 Vague rubrics¶
"Is the response good? 1–5" breeds bias. Break it into concrete axes:
- Bad: "response quality 1–5"
- Good: "accuracy 1–5, completeness 1–5, brevity 1–5" — separate axes, then average
6.3 Same-model Judge¶
Using claude-opus to judge claude-opus responses triggers self-preference. If both are Claude in production, use a two-model ensemble to offset it.
6.4 Judge cost explosion¶
Eval set of 100 × 20 parameter sweeps = 2000 Judge calls. Sonnet pricing = tens of dollars. Start with Haiku. If Haiku agreement ≥ 0.75, stick with it. Below that, upgrade a sample to Sonnet.
7. Operational checklist¶
- Rubric is split into concrete axes (accuracy, completeness, brevity, etc.)
- A/B flip consistency ≥ 0.85 is checked before deployment
- Weekly human sample (≥10 cases) tracks Judge agreement rate
- Judge model is a different family from the evaluated model
- Rubric includes brevity score to mitigate length bias
- Cost budget for Judge calls is tracked (model · call count · tokens)
- Alert on agreement drop (Slack, dashboard)
- Pairwise results are time-series logged so you can spot drift
8. Exercises & next chapter¶
Review questions¶
- Explain pairwise vs. rubric mode in two scenarios: model A/B selection and operational monitoring.
- Write the formula to measure position bias and explain why 0.85 is the target.
- Your Judge's agreement rate came in at 0.55. List three possible causes and how you'd fix each.
- In two paragraphs, explain why "ship based on Judge score alone" is risky.
Hands-on¶
- Take 10 QA pairs from Ch 16's gold set. Generate responses with two prompt versions (baseline · CoT). Run pairwise Judge comparison and measure A/B flip consistency.
- Run the same responses through both Claude and GPT Judges. Quantify self-preference.
Sources¶
- Stanford CME 295 Lecture 8 — LLM-as-a-Judge, bias, calibration. See
_research/stanford-cme295.md - MT-Bench / Chatbot Arena (Zheng et al.) — pairwise judge standard reference
- Anthropic Building Effective Agents — eval loop design. See
_research/anthropic-building-effective-agents.md
Next → Ch 18. Reasoning Quality Improve the answer itself with CoT, self-consistency, best-of-N, and verifier patterns.