Ch 18. Lifting Reasoning Quality¶
What you'll learn
- Four ways to improve reasoning quality by changing your strategy — same model, different approach
- CoT variants · Self-Consistency · Best-of-N · Verifiers — each fits a different problem
- Test-time compute — trade money for quality at inference time
- Building Self-Consistency on math problems and Best-of-N + pytest on code
- The three bottlenecks that hit as you grow N: cost, latency, and verifier quality
Prerequisites
Ch 15–Ch 17. You've got evaluation metrics and a Judge. Now you'll focus on lifting the quality of answers themselves instead of just measuring them.
1. Concept — don't ask once, ask better multiple times¶
Ask an LLM the same question five times and you'll get five different answers. That randomness isn't a bug—it's your biggest asset. This chapter is about turning it into a feature.
Four strategies at a glance:
| Technique | Idea | Key advantage |
|---|---|---|
| ① Single | One direct answer | Fast and cheap |
| ② CoT | "Think step by step" → reveal reasoning trace | Just need more tokens |
| ③ Self-Consistency | N samples → majority vote | Strong on math and multiple choice |
| ④ Best-of-N + Verifier | N candidates → verifier picks the best | Unbeatable on code and checkable tasks |
All four rest on the same principle: test-time compute — freeze the model weights and spend more inference time to buy quality.
2. Why you need this¶
① When you can't swap the model. You need 70% → 85% accuracy but don't have budget for Claude Opus. Inference strategy can bridge that gap.
② When the domain doesn't forgive mistakes. Math, code, money calculations fail catastrophically once. N+5 majority vote or verification raises your floor.
③ When you already have a Judge setup. The Judge from Ch 17 becomes a verifier. You're reusing what works.
Trade-off: cost and latency both multiply by N. We'll address that in §6.
3. Where each technique shines — domain by domain¶
3-1. CoT variants¶
- Math and logic: Just saying "think step by step" lifts accuracy by 20–40 percentage points (Wei 2022)
- Complex classification: Writing out your reasoning catches shallow pattern-matching errors
- Limits: Wrong reasoning can sound plausible. (Ch 19 covers failure modes.)
3-2. Self-Consistency¶
- Discrete answers only: numbers, categories, multiple choice
- How it works: Sample many CoT paths → extract final answer from each → take the majority vote
- Not for: summaries or freeform writing (you can't vote on continuous answers)
3-3. Best-of-N + Verifier¶
- Verifiable tasks: code (run it), SQL (execute on DB), math (substitute back)
- How it works: Generate N candidates → verifier scores or passes each → pick the highest
- Verifier types:
- Deterministic: pytest, SQL execution, regex
- LLM-as-verifier: Reuse your Judge (rubric-based)
- Reward model: Learned scorer (research-grade)
3-4. Tree of Thoughts (concept only)¶
Branch search as a tree. More complex to build, higher cost. This book mentions it for context; see Yao et al. 2023 if you're curious.
4. Minimal example — Self-Consistency on a math problem¶
- Single sampler — high temperature for diversity. CoT system prompt coaxes the model to reason aloud.
- Majority vote — return the most common answer and its agreement rate. 100% match = very confident. 40% = caution.
Use the confidence score: trust only 0.8+, escalate lower ones to a human or different model.
5. Production tutorial — Best-of-N + pytest verifier on code¶
Adding a deterministic verifier (pytest) to code generation transforms your results.
5-1. Flow¶
Q: "Write a function add(a,b) that returns the sum"
↓
N=5 generations (temperature=0.8)
↓
Test each candidate in a sandbox
↓
Return first passing candidate (or use Judge for best)
5-2. Code¶
- Verifier — pytest gives a deterministic pass/fail. No LLM guessing involved.
- Best-of-N loop — return on first pass (early exit). If all fail, return None.
5-3. Expected results¶
- N=1: ~60% accuracy
- N=5 + verifier: 90%+
Cost scales to ~2–3× on average (early exit saves us from hitting the full 5×).
6. Gotchas¶
6-1. Tokens and cost multiply by N¶
Self-Consistency with N=5 = 5× cost · 5× latency (if you don't parallelize). Measure the quality gain on your eval set first. Does the improvement justify the spend?
6-2. Verifier becomes the ceiling¶
Best-of-N can never exceed your verifier's quality. If your LLM Judge agrees only 70% of the time, there's a hard cap. Calibration from Ch 17 is essential.
6-3. Not enough diversity¶
If temperature is 0.3 and you run N=20, you'll get 20 near-identical answers. Pointless. Self-Consistency needs temperature ≥ 0.7.
6-4. Self-Consistency on continuous outputs¶
Majority vote doesn't work for "write a summary." For continuous outputs, use Judge-based Best-of-N (pick the best, don't vote).
6-5. Early stopping trap¶
Returning on first pass keeps latency low and cost proportional, but quality is identical. For maximum quality, generate all N and pick the best with a Judge. Choose based on your goal.
7. Operations checklist¶
- Specified which task types get Self-Consistency vs Best-of-N (math, code, classification only)
- Decided N by measuring on your eval set (don't defaultN=5 by habit)
- Confirmed verifier is deterministic (pytest, regex) or LLM-based with tracked agreement rate
- Verified token and latency costs ×N fit your SLO
- Parallelized if possible (N=5 parallel = 1× latency; N=5 serial = 5× latency)
- Logging Self-Consistency confidence (agreement %) for auto-escalation of low cases
- Measured early-stopping impact on quality
- Periodically compared smaller model + inference strategy vs larger model alone
8. Exercises & next¶
Check your understanding¶
- Why is Self-Consistency incompatible with summaries? What's the Best-of-N + Judge alternative and why does it work?
- Name two extra risks when your verifier is an LLM Judge (see Ch 17).
- Write a formula comparing N=1 (Single) vs N=5 (Self-Consistency) on cost, latency, and quality.
- When does upgrading the model beat increasing test-time compute?
Hands-on¶
- Compare Haiku alone vs Haiku + Self-Consistency (N=5) on 10 GSM8K math problems. Record accuracy for each.
- Implement Best-of-N + pytest on 10 LeetCode Easy problems. Record accuracy at N=1, N=3, N=5.
Sources¶
- Wei et al. 2022 — Chain-of-Thought Prompting
- Wang et al. 2022 — Self-Consistency
- Cobbe et al. 2021 — Verifier and reward models
- Stanford CS329A Lec 2–3 — Test-time compute, Archon, "Let's Verify Step by Step". See project
_research/stanford-cs329a.md
Next → Ch 19. Failure Analysis and Debugging — now that you have numbers, it's time to figure out why they went wrong and separate causes by layer