Ch 19. Failure Analysis and Debugging¶
What you'll learn
- Separate failures into a 5-layer taxonomy — Prompt · Retrieval · Data · Generation · Tool/Flow
- Trace-based debugging that actually reproduces failures — LangSmith · Langfuse
- A repeatable debug loop: failure → classify → fix → prevent regression
- How to handle multi-layer failures that cross boundaries
- Part 4 wrap-up — a 14-week operations routine
Prerequisites
Ch 15–Ch 18. You've got numbers, a Judge, and reasoning strategies. Now you'll systematize what to fix when something breaks.
1. Concept — symptoms vs. causes¶
"The model hallucinated" is a symptom, not a cause. Where did it actually go wrong?
- Prompt wasn't clear enough?
- Search returned the wrong documents?
- The source document itself has a typo?
- Document was fine, but the LLM distorted it?
- Tool returned the right data, but we misinterpreted it?
If you just swap in Sonnet without diagnosing the layer, you burn money without fixing anything. Naming the layer unlocks the actual fix.
This chapter's 5-layer taxonomy is a shared language you'll use throughout the book.
2. Why you need this¶
① Cost of fixes varies wildly. Rewriting a prompt takes 30 minutes. Adding a reranker takes a week. Cleaning documents takes a month. Without naming the layer, prioritization breaks down.
② Switching models is an expensive shortcut. Sure, Opus might improve quality by 20%. But if the real problem is retrieval, you've just tripled cost while the actual issue stays broken. Without layer-aware debugging, you'll repeat this mistake forever.
③ Regression prevention. Once you've fixed a case, put it in your gold set (Ch 16) so the next refactor doesn't break it again. A debug loop closes the circle — it's how you build durably.
3. The 5-layer taxonomy — how to classify¶
| # | Layer | Typical symptom | Typical fix | Ch reference |
|---|---|---|---|---|
| 1 | Prompt | Unclear instructions · missing few-shots · format drift | Redesign prompt · add examples | Ch 5 |
| 2 | Retrieval | Right answer not in top-k results | Add reranker · try hybrid search · switch embeddings | Ch 12 · 13 |
| 3 | Data | Chunk boundaries cut the answer · doc has a typo · missing info | Reset chunking · clean documents | Ch 11 |
| 4 | Generation | Docs are there, but answer is mangled · hallucination | Lower temperature · add CoT · add verifier | Ch 5 · 18 |
| 5 | Tool / Flow | Wrong arguments to a tool · state transition breaks | Improve tool description · fix ACI · state machine | Ch 8 · Part 5 |
Diagnostic checklist¶
- Was the right document in the top-k? → If no, it's Retrieval (2). If yes, continue.
- Does the document actually have the answer? → If no, it's Data (3).
- Document is there, answer is still wrong? → That's Generation (4). Bolster with CoT or a verifier.
- Did the reasoning chain call the wrong tool? → That's Tool/Flow (5).
- Passed all checks and still broken? → Loop back to Prompt (1).
4. Minimal example — classify one failure¶
Let's classify a real failure into one of the five layers.
- First fork — Retrieval is fastest to rule out. A set intersection does it.
- Second fork — Generation uses semantic similarity. Below 0.6 = "docs were there, answer still wrong."
This one function buckets 100 failures into 5 categories in 5 minutes. That's the seed for your weekly review.
5. Hands-on — the trace-based debug loop¶
5-1. Trace collection — LangSmith / Langfuse¶
In production, capture input · retrieval results · the full prompt · output · latency · cost into a single trace. Without this, debugging is impossible.
| trace_setup.py | |
|---|---|
One decorator line and every call is automatically recorded. Filter for traces marked as 👎 and move to the next step.
5-2. Weekly review routine — 20 case sample¶
Monday: Pick 20 cases: last week's 👎 + low judge scores (balanced across layers)
Tuesday: Run classify() for first pass → human confirmation → finalize
Wednesday: Pick the top 1–2 root causes per layer as fix candidates
Thursday: Apply fixes · re-run evals · check for regression
Friday: Move fixed cases into gold set · add to next week's regression tests
Run this for just 4 weeks and your failure distribution shifts. "Huh, retrieval problems are gone now — we're left with generation issues" → you know where to focus next.
5-3. Multi-layer failures — when it spans multiple layers¶
Often one failure touches 2–3 layers. Example: "Search missed two critical documents + prompt doesn't trigger CoT → generation gets confused and hallucinates."
The strategy:
- Isolate causes one at a time. Test whether changing just the prompt to add CoT fixes it. Then test whether search alone would fix it. Don't move both dials at once.
- Fix cheapest layer first. Prompt (30 min) → parameters (2 hours) → retrieval (days) → model swap (last resort).
- One axis at a time. Change two things simultaneously and you can't tell which one worked.
5-4. Prevent regression — add to gold set¶
The key: once you've fixed a case, dump it into regression.jsonl so your next CI run validates it automatically. "Same mistake twice" becomes impossible.
6. Common pitfalls¶
6-1. Swap models to hide symptoms¶
Opus gets you 70→78% accuracy, and if the real problem is retrieval, the failure mode stays the same. You just paid 3x to paper over the problem. Don't upgrade models without layer diagnosis.
6-2. Classify nothing, fix everything¶
"50 failures, I'll just rewrite the prompt" is buckshot. If 40 of those 50 are retrieval failures, prompt work is 0% effective.
6-3. No traces saved¶
"I can't reproduce it" = no traces. No debugging → wild guesses → hope for quality. Pick one — LangSmith, Langfuse, or your own logs — and deploy it mandatory.
6-4. Fix once and move on¶
If a fixed case isn't in the eval set, you'll break it again 3 months later on an unrelated change. Regression set inclusion is the final step that closes the loop.
6-5. Fix multiple layers at once¶
Change prompt + retrieval + model simultaneously → quality goes up, but you have no idea what worked. Run A/B splits, one axis at a time.
7. Production checklist¶
- All requests save traces (LangSmith/Langfuse/custom)
-
classify()runs automatically for first-pass categorization - Weekly 20-case review is on the team calendar
- Fixed failures land in regression.jsonl and run in CI
- Monthly dashboard shows failure distribution trend across the 5 layers
- Multi-layer fixes reflect "cheapest layer first" in PR descriptions
- Model upgrade decisions are preceded by layer analysis
- Failure root causes get brief write-ups (Slack/Notion) for org learning
8. Exercises¶
Check your understanding¶
- For each of the 5 layers, rewrite one diagnostic question in your own words.
- Give an example where model swaps help — and an example where they don't.
- Explain the "cheapest layer first" principle in terms of cost × impact.
- What problem repeats if you fix a failure but don't add it to
regression.jsonl?
Hands-on¶
- Collect 20 failures from your prototype → use
classify()for first pass → organize in a table. Find the top root cause and apply the smallest 30-minute fix you can → re-evaluate. - Install LangSmith or Langfuse on the mini_rag from Ch 11. Find one trace in the UI and walk through it step-by-step.
Sources¶
- LangSmith / Langfuse official docs — trace setup guides
- Stanford CS329A homework — failure classification tasks · agent debugging routines. See
_research/stanford-cs329a.md.
9. Part 4 wrap-up — evaluation · reasoning · debugging cycle¶
Here's Part 4 in one table:
| Ch | Topic | Key output |
|---|---|---|
| 15 | What to measure | 3 layers (Retrieval·Generation·E2E) + Offline/Online |
| 16 | Build an eval set | 30–100 gold cases · coverage matrix · 20% held out |
| 17 | LLM-as-a-Judge | Pairwise/Rubric + consistency ≥0.85 · human agreement ≥0.8 |
| 18 | Reasoning quality | Self-Consistency · Best-of-N + verifier (test-time compute) |
| 19 | Failure analysis | 5-layer taxonomy · debug loop · regression inclusion |
You're ready when¶
- 30+ gold test cases are version-controlled and auto-evaluated in CI
- Judge design + A/B consistency is measured
- You've experimented with Self-Consistency or Best-of-N on your own task at least once
- Traces are saved and failures classify automatically
- At least one case sits in
regression.jsonl
What's next — Part 5. Agents & LangGraph¶
With evaluation infrastructure in place, you're ready for complex workflows. No more single LLM calls — now you'll build multi-step reasoning · tool use · state management. Part 5 unfolds entirely on that foundation.
Next → Ch 20. What Is an Agent
Now that you can measure and fix single turns, let's build systems that chain multiple LLM calls, use tools, and reason over time.