Distillation Mini¶

What you'll learn

Distillation — the pattern where a teacher model trains a student model
Synthetic labels from a teacher (Qwen 2.5-1.5B) → student (Qwen 2.5-0.5B or the book's 30M) SFT
A mini version of what SmolLM2 / Gemma 2 actually did
Why filtering matters — preventing teacher hallucinations and biases from copying to the student

Prerequisites

Ch 24 LoRA, Ch 5 Synthetic Data, Ch 7 Data Quality.

Distillation — Teacher → Filter → Student

1. Concept — "A Larger Model Teaches a Smaller One"¶

Distillation (Hinton et al., 2015) originally meant soft-target distillation — the student learns to mimic the teacher's logit distribution. In modern LLMs, the standard is hard distillation — the student SFT-trains on text outputs generated by the teacher.

Approach	What the student learns	Used in
Soft distillation	Teacher's vocab probability distribution (logits)	DistilBERT
Hard distillation	Teacher's text output only	Modern LLM standard

This book uses hard distillation — simple, and the code is nearly identical to LoRA SFT.

2. Why Distillation Works¶

	Human labeling	Distillation
Cost	$5/pair	$0.001/pair (Haiku)
Speed	10K pairs/week	10K pairs/hour
Consistency	varies across labelers	consistent teacher
Capability ceiling	human ability	teacher's ability

The core idea: the teacher replaces expensive human labelers. But the teacher's hallucinations and biases flow directly into the student — filtering is essential.

Real-world examples¶

Gemma 2-2B — distilled from the larger Gemma 2-9B/27B
SmolLM2 — effectively distilled via Cosmopedia (Mixtral-synthesized data)
Phi-3.5-mini — trained heavily on GPT-4 synthetic data
Llama 3 small — distilled from Llama 3-405B

Modern SLM training data is almost entirely distillation-based.

3. Book's Mini Scenario¶

Role	Model	Size
Teacher	Qwen 2.5-1.5B-Instruct	1.5B
Student	Qwen 2.5-0.5B-Instruct	0.5B
Domain	Korean fairytales

After training: Student reaches 90–95% of Teacher's fairytale quality at 3× smaller and 3× faster.

4. Generating Teacher Labels¶

Same code pattern as Ch 5's synthetic data — but Teacher is a local Qwen 1.5B, not an API.

distill_collect.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json, random

teacher = "Qwen/Qwen2.5-1.5B-Instruct"
tok = AutoTokenizer.from_pretrained(teacher)
T = AutoModelForCausalLM.from_pretrained(teacher, torch_dtype=torch.bfloat16, device_map="auto")

CHARACTERS = ["Rabbit Toto","Bear Dudu","Grandmother","Cat Mimi"]
KEYWORDS = ["carrot","rain","moon","friend","mom","flower"]

samples = []
for i in range(5000):
    char = random.choice(CHARACTERS)
    kws = random.sample(KEYWORDS, 2)
    prompt = f"Write one Korean fairytale for children ages 3–5. Character: {char}. Keywords: {kws[0]}, {kws[1]}. 200–400 characters."
    msgs = [{"role":"user","content":prompt}]
    ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).cuda()
    out = T.generate(ids, max_new_tokens=500, temperature=0.8, top_p=0.9, do_sample=True)
    text = tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)
    samples.append({"instruction": prompt, "output": text})
    if i % 100 == 0: print(f"  {i}/5000")

5K pairs ≈ 1–2 hours on T4. API cost: $0.

5. Filtering¶

distill_filter.py
def passes(text, instruction):
    if len(text) < 150 or len(text) > 600: return False
    if any(w in text for w in ["AI","GPT","model","robot","artificial intelligence"]): return False
    if text.count("like") > 5: return False
    if instruction[:20] in text: return False
    ko = sum(1 for c in text if "가" <= c <= "힣")
    if ko / max(len(text), 1) < 0.5: return False
    return True

filtered = [s for s in samples if passes(s["output"], s["instruction"])]
print(f"Pass rate: {len(filtered)/len(samples):.0%}")    # 50–80%

Optionally: run LLM judge scoring (Ch 17) and keep only scores ≥ 3.

6. Student SFT — Same Code as Ch 24¶

# Identical to Ch 24 LoRA SFT code
# Only the data file changes: distill_train_filtered.jsonl
base = "Qwen/Qwen2.5-0.5B-Instruct"   # Student
# ... LoRA + Trainer + train ...

After training:

Student PPL = 90–95% of Teacher's
3× faster, 1/3 the memory

7. Common Failure Points¶

Skipping the filter — Teacher hallucinations copy into the student. A 50%+ pass rate is a safe target.
Too large a teacher-student size gap — 100B → 0.5B is too much. A 3–10× ratio works well.
Distillation chains — Distilling from an already-distilled model → model collapse.
Domain shift in the teacher — Verify the teacher's Korean training coverage for the target domain.
Not enough instruction diversity — Same prompt repeated = similar outputs from the teacher.
Expecting the student to beat the teacher — The student mimics the teacher. It can't exceed the teacher's capability.

8. Ops Checklist¶

Teacher/student size ratio: 3–10×
Verify teacher's domain capability
Generate 5K–50K pairs
Filtering pass rate ≥ 50%
Human review of 100 samples
Evaluate student PPL + domain probes
3-axis teacher vs student comparison (accuracy, speed, memory)
License chain check (Teacher API Terms of Service)

9. Exercises¶

Run Qwen 2.5-1.5B → 0.5B distillation with 1,000 pairs. What's the pass rate and PPL?
Compare filtering thresholds at 30% / 50% / 80%.
How much does quality improve if Teacher = GPT-4 API?
Compare distillation Student vs LoRA SFT (Ch 24) on the same domain.
(Think about it) If the Teacher is an OpenAI model, can the Student be released under Apache 2.0?

References¶

Hinton et al. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531
Sanh et al. (2019). DistilBERT. arXiv:1910.01108
Gemma Team (2024). Gemma 2 Technical Report.
HuggingFace SmolLM2 / Cosmopedia blog