Ch 34. Small Models, Distillation, and DPO¶

What you'll learn

Where small models fit — when to swap a large model for a small one
Distillation — the pattern where a teacher trains a student
DPO vs RLHF — preference alignment without a reward model
SFT → DPO production pipeline
The traps in synthetic data (bias amplification, hallucination cloning)
Six common pitfalls (DPO without SFT · weak distillation validation · Constitutional AI myths · missing out-of-distribution eval · capability ceiling · inherited hallucinations)
Graduating Part 7 and moving toward capstone

Prerequisites

You've walked through the decision tree in Ch 32 and felt LoRA at work in Ch 33. This chapter is the big picture and what comes next.

1. Concept — where small models belong¶

If a small model preserves the large model's quality on the target task, it can reduce per-call cost. Calculate the ratio from current pricing and observed token usage (Ch 30).

Two paths:

Approach	What you train	When to use
Domain SFT (Ch 33)	Small model, directly, on domain data	When you have labeled data
Distillation	Small model learns from answers a large model generates	When labeling is expensive or scarce

Distillation swaps human annotation cost for teacher inference — a large model writes the answers, and the student learns from them.

Distillation pipeline

Five steps:

Unlabeled queries — from production logs or synthetic (thousands to tens of thousands)
Teacher inference — large model generates answers
Filter — validate quality (judge LLM + rules). This step decides the outcome.
Student SFT (LoRA) — exactly as in Ch 33
Deploy — remeasure quality, latency, and actual cost on the small model

"In distillation, data curation decides the result more than the learning algorithm."

2. Why it matters — when distillation beats SFT¶

Situation	Direct SFT	Distillation
You have correct labels	◎	△ (unnecessary)
The right answer is ambiguous (long outputs)	△ (hard to label)	◎
Annotator cost > teacher API cost	△	◎
Need to match domain tone and style	◎	◎
Need factual accuracy in the domain	△	△ (RAG is the answer)

Cost of annotation vs. cost of teacher inference is the clearest dividing line. Five thousand samples labeled by people might run $5K; the same five thousand generated by Opus cost roughly $150.

3. Where it's used — three alignment methods compared¶

The final alignment stage of an LLM (Stage 3 from Ch 31) is something you can tune yourself.

Three alignment methods

Method	Data	Algorithm	Cost	Can we do it?
SFT	(q, a)	next-token loss	$$	◎ (Ch 33)
DPO	(q, ✓, ✗)	preference loss, no reward model	$$$	◎
RLHF	(q, ✓, ✗) → reward model → PPO	RL loop	$$$$	△ (big labs only)

DPO in one line¶

DPO rolls reward model training and PPO into a single loss:

\[ \mathcal{L}_{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \]

The intuition: "make the preferred answer (y_w) more likely than the rejected one (y_l)." You need two models (the one being trained, π_θ, and a frozen reference, π_ref), but no separate reward model.

In practice: five thousand preference pairs is enough for DPO. RLHF with the same data still requires training a reward model and running PPO — usually we stop at DPO.

Constitutional AI — cutting human labels¶

Anthropic's Constitutional AI replaces human preference labels with the model's own self-criticism:

Model generates an answer
"Does this answer follow our constitution (policy)?" — the model judges itself
If not, the model revises
(Original, revised) pairs go to DPO

Pro: fewer human labels. Con: model bias flows into constitutional judgment — you still need human review.

4. Minimal example — DPO in 30 lines¶

dpo_train.py
from datasets import Dataset
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

# Data — preference pairs                                                 (1)
pairs = [
    {"prompt": "Can I get a refund?",
     "chosen": "Yes, within 30 days. Please have your receipt ready.",
     "rejected": "No refunds."},
    # ...
]
ds = Dataset.from_list(pairs)

base_id = "meta-llama/Llama-3.1-8B-Instruct"                            # (2)!
tok = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype="bfloat16", device_map="auto")
model = get_peft_model(model, LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"]))

trainer = DPOTrainer(
    model=model, tokenizer=tok, train_dataset=ds,
    args=DPOConfig(                                                      # (3)!
        output_dir="dpo_out", per_device_train_batch_size=2,
        gradient_accumulation_steps=8, num_train_epochs=1,
        learning_rate=5e-6, beta=0.1, bf16=True,
    ),
)
trainer.train()
trainer.save_model("dpo_out/adapter")

The key columns: prompt · chosen · rejected.
DPO runs on top of an SFT model. Don't DPO a base model directly.
learning_rate is much smaller than SFT (5e-6). beta controls KL penalty strength (0.1 is standard).

Generating distillation data¶

distill_collect.py
from anthropic import Anthropic
client = Anthropic()

queries = load_queries("logs.jsonl")[:5000]
out = []
for q in queries:
    a = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{"role":"user","content": q}]
    ).content[0].text
    if pass_filter(q, a):                                               # (1)!
        out.append({"text": format_chat(q, a)})
save_jsonl("distill_train.jsonl", out)

The filter matters. LLM judge (Ch 17) + rules (length, forbidden words, format). Filter pass rate is usually 30–70%.

5. Hands-on — SFT → DPO production pipeline¶

Production logs → SFT data (q,a) → SFT (LoRA · Ch 33)
                                         ↓
                            Base SFT model
                                         ↓
                  Humans create (good, bad) pairs, 1K–5K
                                         ↓
                    DPO (LoRA · above §4) → Aligned model
                                         ↓
               Domain eval + safety regression (Ch 28)
                                         ↓
                     Deploy (Ch 26 · adapter swap)

One quarter (3 months) is one cycle. SFT takes a month, collecting DPO pairs takes a month, DPO training and validation takes a month.

Evaluation — out-of-distribution is the real signal¶

Everything works on data you've seen. The real test is:

Held-out domain: similar domain you didn't train on — measures generalization
Adversarial: jailbreaks, injections, traps (Ch 28)
Safety regression: alignment didn't break (Ch 28 guardrail pass rate)
Regression set: cases the old version handled well (regression.jsonl · Ch 19)

The most common accident after DPO: the refusal tone improved but general answers got shorter. You won't catch this without out-of-distribution eval.

Cost model¶

Step	Cost (example)
Label 5K SFT samples	$2K–5K
Train SFT (LoRA/QLoRA)	$50
Create 1K DPO pairs	$1K–3K
Train DPO	$50
Eval infrastructure (Ch 16)	$200/month
Total (one quarter)	$3–8K

The measured operational savings from the small model must recoup this cost for a positive ROI. Recalculate over a year with current prices and observed tokens.

6. Common failure points¶

DPO without SFT first. If you DPO a base model directly, training won't land. SFT teaches "how to format a response"; DPO then teaches "which response is better."
Weak distillation validation. You learn exactly what the teacher generated, bugs and all — teacher bias and hallucinations clone into the student. The filter (judge + rules) is everything.
Constitutional AI illusion. It sounds like "we can align without humans," but in reality you still need human review — to validate the auto-judge itself.
Missing out-of-distribution eval. If you only test on your training distribution, you'll miss generalization failures. Hold-out, adversarial, and regression evals are essential.
Expecting small models to solve everything. A task that demands complex reasoning won't suddenly work on Haiku just because you trained it. Capability ceilings can't be learned.
Inheriting teacher hallucinations. Even Opus makes factual errors sometimes. Learn them directly, and your small model will reliably repeat them. Add fact-checking to the filter.
Wrong DPO beta. Too high (β > 0.5) and you're stuck near the reference; too low (β < 0.05) and you abandon the reference → training destabilizes. Start with β = 0.1.
Chosen/rejected are only surface differences. Pairs with almost no semantic gap won't teach anything. They need clear differences in tone, accuracy, or completeness.

7. Operations checklist¶

8. Exercises and Part 7 wrap-up¶

You have 5K queries and a labeler budget of $500. Design a distillation process to create SFT data, including your filtering policy.
Your SFT model's refusals are too cold. You want to warm them up with DPO. Design five (chosen, rejected) pairs to achieve it.
Your 8B SFT+DPO model regresses by 3 points on out-of-distribution eval vs. baseline. Diagnose and list three remedies.
Identify one task in your domain that "only large models can solve" and explain why a small model can't learn it, from a capability perspective.

Part 7 wrap-up — five graduation artifacts from Models & Finetuning¶

#	Artifact	From chapter
①	Model card review template + context limits, tokenizers, licenses	Ch 31
②	Finetuning decision document (did you pass all four gates?)	Ch 32
③	LoRA PoC notebook + effect measurement results	Ch 33
④	SFT → DPO operations pipeline design	Ch 34
⑤	Small model routing strategy (bridges Ch 30 and Ch 34)	Ch 30 + 34

Bridge to capstone — Self-Improving Assistant¶

Every piece from Parts 1–7 comes together at the capstone.

Part	Piece	Role in capstone
1	LLM fundamentals	What's possible
2	APIs, prompts, tools	How to call models
3	RAG	Domain facts
4	Eval and debugging	Self-measurement
5	Agents, LangGraph, memory	Loop and state
6	Guardrails, monitoring, cost	Operations
7	Models and finetuning	Self-improvement (failed cases become training data)

The Self-Improving Assistant stitches these seven pieces into one system and runs a closed loop: user feedback → auto-join to eval set → quarterly SFT/DPO → deploy.

Next → Capstone: Self-Improving Assistant

Sources¶

Rafailov et al. (2023) Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Ouyang et al. (2022) InstructGPT (RLHF original)
Bai et al. (2022) Constitutional AI (Anthropic)
Hinton et al. (2015) Distilling the Knowledge in a Neural Network
Hugging Face TRL — DPOTrainer docs
Stanford CME 295 Lec 5