Distillation Mini¶
What you'll learn
- Distillation — the pattern where a teacher model trains a student model
- Synthetic labels from a teacher (Qwen 2.5-1.5B) → student (Qwen 2.5-0.5B or the book's 30M) SFT
- A mini version of what SmolLM2 / Gemma 2 actually did
- Why filtering matters — preventing teacher hallucinations and biases from copying to the student
Prerequisites
1. Concept — "A Larger Model Teaches a Smaller One"¶
Distillation (Hinton et al., 2015) originally meant soft-target distillation — the student learns to mimic the teacher's logit distribution. In modern LLMs, the standard is hard distillation — the student SFT-trains on text outputs generated by the teacher.
| Approach | What the student learns | Used in |
|---|---|---|
| Soft distillation | Teacher's vocab probability distribution (logits) | DistilBERT |
| Hard distillation | Teacher's text output only | Modern LLM standard |
This book uses hard distillation — simple, and the code is nearly identical to LoRA SFT.
2. Why Distillation Works¶
| Human labeling | Distillation | |
|---|---|---|
| Cost | $5/pair | $0.001/pair (Haiku) |
| Speed | 10K pairs/week | 10K pairs/hour |
| Consistency | varies across labelers | consistent teacher |
| Capability ceiling | human ability | teacher's ability |
The core idea: the teacher replaces expensive human labelers. But the teacher's hallucinations and biases flow directly into the student — filtering is essential.
Real-world examples¶
- Gemma 2-2B — distilled from the larger Gemma 2-9B/27B
- SmolLM2 — effectively distilled via Cosmopedia (Mixtral-synthesized data)
- Phi-3.5-mini — trained heavily on GPT-4 synthetic data
- Llama 3 small — distilled from Llama 3-405B
Modern SLM training data is almost entirely distillation-based.
3. Book's Mini Scenario¶
| Role | Model | Size |
|---|---|---|
| Teacher | Qwen 2.5-1.5B-Instruct | 1.5B |
| Student | Qwen 2.5-0.5B-Instruct | 0.5B |
| Domain | Korean fairytales |
After training: Student reaches 90–95% of Teacher's fairytale quality at 3× smaller and 3× faster.
4. Generating Teacher Labels¶
Same code pattern as Ch 5's synthetic data — but Teacher is a local Qwen 1.5B, not an API.
5K pairs ≈ 1–2 hours on T4. API cost: $0.
5. Filtering¶
Optionally: run LLM judge scoring (Ch 17) and keep only scores ≥ 3.
6. Student SFT — Same Code as Ch 24¶
# Identical to Ch 24 LoRA SFT code
# Only the data file changes: distill_train_filtered.jsonl
base = "Qwen/Qwen2.5-0.5B-Instruct" # Student
# ... LoRA + Trainer + train ...
After training:
- Student PPL = 90–95% of Teacher's
- 3× faster, 1/3 the memory
7. Common Failure Points¶
- Skipping the filter — Teacher hallucinations copy into the student. A 50%+ pass rate is a safe target.
- Too large a teacher-student size gap — 100B → 0.5B is too much. A 3–10× ratio works well.
- Distillation chains — Distilling from an already-distilled model → model collapse.
- Domain shift in the teacher — Verify the teacher's Korean training coverage for the target domain.
- Not enough instruction diversity — Same prompt repeated = similar outputs from the teacher.
- Expecting the student to beat the teacher — The student mimics the teacher. It can't exceed the teacher's capability.
8. Ops Checklist¶
- Teacher/student size ratio: 3–10×
- Verify teacher's domain capability
- Generate 5K–50K pairs
- Filtering pass rate ≥ 50%
- Human review of 100 samples
- Evaluate student PPL + domain probes
- 3-axis teacher vs student comparison (accuracy, speed, memory)
- License chain check (Teacher API Terms of Service)
9. Exercises¶
- Run Qwen 2.5-1.5B → 0.5B distillation with 1,000 pairs. What's the pass rate and PPL?
- Compare filtering thresholds at 30% / 50% / 80%.
- How much does quality improve if Teacher = GPT-4 API?
- Compare distillation Student vs LoRA SFT (Ch 24) on the same domain.
- (Think about it) If the Teacher is an OpenAI model, can the Student be released under Apache 2.0?
References¶
- Hinton et al. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531
- Sanh et al. (2019). DistilBERT. arXiv:1910.01108
- Gemma Team (2024). Gemma 2 Technical Report.
- HuggingFace SmolLM2 / Cosmopedia blog