Ch 34. Small Models, Distillation, and DPO¶
What you'll learn
- Where small models fit — when to swap a large model for a small one
- Distillation — the pattern where a teacher trains a student
- DPO vs RLHF — preference alignment without a reward model
- SFT → DPO production pipeline
- The traps in synthetic data (bias amplification, hallucination cloning)
- Six common pitfalls (DPO without SFT · weak distillation validation · Constitutional AI myths · missing out-of-distribution eval · capability ceiling · inherited hallucinations)
- Graduating Part 7 and moving toward capstone
Prerequisites
You've walked through the decision tree in Ch 32 and felt LoRA at work in Ch 33. This chapter is the big picture and what comes next.
1. Concept — where small models belong¶
If you can copy the response quality of a large model (Opus, GPT-4) into a small model (Haiku, 8B), you cut costs 30× (Ch 30).
Two paths:
| Approach | What you train | When to use |
|---|---|---|
| Domain SFT (Ch 33) | Small model, directly, on domain data | When you have labeled data |
| Distillation | Small model learns from answers a large model generates | When labeling is expensive or scarce |
Distillation swaps human annotation cost for teacher inference — a large model writes the answers, and the student learns from them.
Five steps:
- Unlabeled queries — from production logs or synthetic (thousands to tens of thousands)
- Teacher inference — large model generates answers
- Filter — validate quality (judge LLM + rules). This step decides the outcome.
- Student SFT (LoRA) — exactly as in Ch 33
- Deploy — run the small model at 30× lower cost
"In distillation, data curation decides the result more than the learning algorithm."
2. Why it matters — when distillation beats SFT¶
| Situation | Direct SFT | Distillation |
|---|---|---|
| You have correct labels | ◎ | △ (unnecessary) |
| The right answer is ambiguous (long outputs) | △ (hard to label) | ◎ |
| Annotator cost > teacher API cost | △ | ◎ |
| Need to match domain tone and style | ◎ | ◎ |
| Need factual accuracy in the domain | △ | △ (RAG is the answer) |
Cost of annotation vs. cost of teacher inference is the clearest dividing line. Five thousand samples labeled by people might run $5K; the same five thousand generated by Opus cost roughly $150.
3. Where it's used — three alignment methods compared¶
The final alignment stage of an LLM (Stage 3 from Ch 31) is something you can tune yourself.
| Method | Data | Algorithm | Cost | Can we do it? |
|---|---|---|---|---|
| SFT | (q, a) | next-token loss | $$ | ◎ (Ch 33) |
| DPO | (q, ✓, ✗) | preference loss, no reward model | $$$ | ◎ |
| RLHF | (q, ✓, ✗) → reward model → PPO | RL loop | $$$$ | △ (big labs only) |
DPO in one line¶
DPO rolls reward model training and PPO into a single loss:
The intuition: "make the preferred answer (y_w) more likely than the rejected one (y_l)." You need two models (the one being trained, π_θ, and a frozen reference, π_ref), but no separate reward model.
In practice: five thousand preference pairs is enough for DPO. RLHF with the same data still requires training a reward model and running PPO — usually we stop at DPO.
Constitutional AI — cutting human labels¶
Anthropic's Constitutional AI replaces human preference labels with the model's own self-criticism:
- Model generates an answer
- "Does this answer follow our constitution (policy)?" — the model judges itself
- If not, the model revises
- (Original, revised) pairs go to DPO
Pro: fewer human labels. Con: model bias flows into constitutional judgment — you still need human review.
4. Minimal example — DPO in 30 lines¶
- The key columns:
prompt · chosen · rejected. - DPO runs on top of an SFT model. Don't DPO a base model directly.
learning_rateis much smaller than SFT (5e-6).betacontrols KL penalty strength (0.1 is standard).
Generating distillation data¶
- The filter matters. LLM judge (Ch 17) + rules (length, forbidden words, format). Filter pass rate is usually 30–70%.
5. Hands-on — SFT → DPO production pipeline¶
Production logs → SFT data (q,a) → SFT (LoRA · Ch 33)
↓
Base SFT model
↓
Humans create (good, bad) pairs, 1K–5K
↓
DPO (LoRA · above §4) → Aligned model
↓
Domain eval + safety regression (Ch 28)
↓
Deploy (Ch 26 · adapter swap)
One quarter (3 months) is one cycle. SFT takes a month, collecting DPO pairs takes a month, DPO training and validation takes a month.
Evaluation — out-of-distribution is the real signal¶
Everything works on data you've seen. The real test is:
- Held-out domain: similar domain you didn't train on — measures generalization
- Adversarial: jailbreaks, injections, traps (Ch 28)
- Safety regression: alignment didn't break (Ch 28 guardrail pass rate)
- Regression set: cases the old version handled well (regression.jsonl · Ch 19)
The most common accident after DPO: the refusal tone improved but general answers got shorter. You won't catch this without out-of-distribution eval.
Cost model¶
| Step | Cost (example) |
|---|---|
| Label 5K SFT samples | $2K–5K |
| Train SFT (LoRA/QLoRA) | $50 |
| Create 1K DPO pairs | $1K–3K |
| Train DPO | $50 |
| Eval infrastructure (Ch 16) | $200/month |
| Total (one quarter) | $3–8K |
The operational savings (Opus → Haiku is 30× cheaper) must recoup this cost for a positive ROI. Calculate over a year.
6. Common failure points¶
- DPO without SFT first. If you DPO a base model directly, training won't land. SFT teaches "how to format a response"; DPO then teaches "which response is better."
- Weak distillation validation. You learn exactly what the teacher generated, bugs and all — teacher bias and hallucinations clone into the student. The filter (judge + rules) is everything.
- Constitutional AI illusion. It sounds like "we can align without humans," but in reality you still need human review — to validate the auto-judge itself.
- Missing out-of-distribution eval. If you only test on your training distribution, you'll miss generalization failures. Hold-out, adversarial, and regression evals are essential.
- Expecting small models to solve everything. A task that demands complex reasoning won't suddenly work on Haiku just because you trained it. Capability ceilings can't be learned.
- Inheriting teacher hallucinations. Even Opus makes factual errors sometimes. Learn them directly, and your small model will reliably repeat them. Add fact-checking to the filter.
- Wrong DPO beta. Too high (β > 0.5) and you're stuck near the reference; too low (β < 0.05) and you abandon the reference → training destabilizes. Start with β = 0.1.
- Chosen/rejected are only surface differences. Pairs with almost no semantic gap won't teach anything. They need clear differences in tone, accuracy, or completeness.
7. Operations checklist¶
- When distilling, monitor filter (judge + rules) pass rate on teacher answers
- DPO only on top of an SFT model
- Chosen/rejected pairs in DPO data are meaningfully different
- Start with β = 0.1, learning rate 5e-6 to 1e-5
- Out-of-distribution eval on three axes (hold-out · adversarial · regression)
- Automate safety regression checks (Ch 28 guardrail pass rates)
- Document small model capability limits (route hard queries to large model)
- Sustain the quarterly cycle (SFT → DPO → eval → deploy)
- Calculate year-one ROI (Ch 32 §5)
- Log adapter metadata + training data hash + base model version
8. Exercises and Part 7 wrap-up¶
- You have 5K queries and a labeler budget of $500. Design a distillation process to create SFT data, including your filtering policy.
- Your SFT model's refusals are too cold. You want to warm them up with DPO. Design five (chosen, rejected) pairs to achieve it.
- Your 8B SFT+DPO model regresses by 3 points on out-of-distribution eval vs. baseline. Diagnose and list three remedies.
- Identify one task in your domain that "only large models can solve" and explain why a small model can't learn it, from a capability perspective.
Part 7 wrap-up — five graduation artifacts from Models & Finetuning¶
| # | Artifact | From chapter |
|---|---|---|
| ① | Model card review template + context limits, tokenizers, licenses | Ch 31 |
| ② | Finetuning decision document (did you pass all four gates?) | Ch 32 |
| ③ | LoRA PoC notebook + effect measurement results | Ch 33 |
| ④ | SFT → DPO operations pipeline design | Ch 34 |
| ⑤ | Small model routing strategy (bridges Ch 30 and Ch 34) | Ch 30 + 34 |
Bridge to capstone — Self-Improving Assistant¶
Every piece from Parts 1–7 comes together at the capstone.
| Part | Piece | Role in capstone |
|---|---|---|
| 1 | LLM fundamentals | What's possible |
| 2 | APIs, prompts, tools | How to call models |
| 3 | RAG | Domain facts |
| 4 | Eval and debugging | Self-measurement |
| 5 | Agents, LangGraph, memory | Loop and state |
| 6 | Guardrails, monitoring, cost | Operations |
| 7 | Models and finetuning | Self-improvement (failed cases become training data) |
The Self-Improving Assistant stitches these seven pieces into one system and runs a closed loop: user feedback → auto-join to eval set → quarterly SFT/DPO → deploy.
Next → Capstone: Self-Improving Assistant
Sources¶
- Rafailov et al. (2023) Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Ouyang et al. (2022) InstructGPT (RLHF original)
- Bai et al. (2022) Constitutional AI (Anthropic)
- Hinton et al. (2015) Distilling the Knowledge in a Neural Network
- Hugging Face TRL — DPOTrainer docs
- Stanford CME 295 Lec 5