Skip to content

Distillation Mini

Open in Colab

What you'll learn

  • Distillation — the pattern where a teacher model trains a student model
  • Synthetic labels from a teacher (Qwen 2.5-1.5B) → student (Qwen 2.5-0.5B or the book's 30M) SFT
  • A mini version of what SmolLM2 / Gemma 2 actually did
  • Why filtering matters — preventing teacher hallucinations and biases from copying to the student

Distillation — Teacher → Filter → Student Distillation — Teacher → Filter → Student

1. Concept — "A Larger Model Teaches a Smaller One"

Distillation (Hinton et al., 2015) originally meant soft-target distillation — the student learns to mimic the teacher's logit distribution. In modern LLMs, the standard is hard distillation — the student SFT-trains on text outputs generated by the teacher.

Approach What the student learns Used in
Soft distillation Teacher's vocab probability distribution (logits) DistilBERT
Hard distillation Teacher's text output only Modern LLM standard

This book uses hard distillation — simple, and the code is nearly identical to LoRA SFT.


2. Why Distillation Works

Human labeling Distillation
Cost $5/pair $0.001/pair (Haiku)
Speed 10K pairs/week 10K pairs/hour
Consistency varies across labelers consistent teacher
Capability ceiling human ability teacher's ability

The core idea: the teacher replaces expensive human labelers. But the teacher's hallucinations and biases flow directly into the student — filtering is essential.

Real-world examples

  • Gemma 2-2B — distilled from the larger Gemma 2-9B/27B
  • SmolLM2 — effectively distilled via Cosmopedia (Mixtral-synthesized data)
  • Phi-3.5-mini — trained heavily on GPT-4 synthetic data
  • Llama 3 small — distilled from Llama 3-405B

Modern SLM training data is almost entirely distillation-based.


3. Book's Mini Scenario

Role Model Size
Teacher Qwen 2.5-1.5B-Instruct 1.5B
Student Qwen 2.5-0.5B-Instruct 0.5B
Domain Korean fairytales

After training: Student reaches 90–95% of Teacher's fairytale quality at 3× smaller and 3× faster.


4. Generating Teacher Labels

Same code pattern as Ch 5's synthetic data — but Teacher is a local Qwen 1.5B, not an API.

distill_collect.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json, random

teacher = "Qwen/Qwen2.5-1.5B-Instruct"
tok = AutoTokenizer.from_pretrained(teacher)
T = AutoModelForCausalLM.from_pretrained(teacher, torch_dtype=torch.bfloat16, device_map="auto")

CHARACTERS = ["Rabbit Toto","Bear Dudu","Grandmother","Cat Mimi"]
KEYWORDS = ["carrot","rain","moon","friend","mom","flower"]

samples = []
for i in range(5000):
    char = random.choice(CHARACTERS)
    kws = random.sample(KEYWORDS, 2)
    prompt = f"Write one Korean fairytale for children ages 3–5. Character: {char}. Keywords: {kws[0]}, {kws[1]}. 200–400 characters."
    msgs = [{"role":"user","content":prompt}]
    ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).cuda()
    out = T.generate(ids, max_new_tokens=500, temperature=0.8, top_p=0.9, do_sample=True)
    text = tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)
    samples.append({"instruction": prompt, "output": text})
    if i % 100 == 0: print(f"  {i}/5000")

5K pairs ≈ 1–2 hours on T4. API cost: $0.


5. Filtering

distill_filter.py
def passes(text, instruction):
    if len(text) < 150 or len(text) > 600: return False
    if any(w in text for w in ["AI","GPT","model","robot","artificial intelligence"]): return False
    if text.count("like") > 5: return False
    if instruction[:20] in text: return False
    ko = sum(1 for c in text if "가" <= c <= "힣")
    if ko / max(len(text), 1) < 0.5: return False
    return True

filtered = [s for s in samples if passes(s["output"], s["instruction"])]
print(f"Pass rate: {len(filtered)/len(samples):.0%}")    # 50–80%

Optionally: run LLM judge scoring (Ch 17) and keep only scores ≥ 3.


6. Student SFT — Same Code as Ch 24

# Identical to Ch 24 LoRA SFT code
# Only the data file changes: distill_train_filtered.jsonl
base = "Qwen/Qwen2.5-0.5B-Instruct"   # Student
# ... LoRA + Trainer + train ...

After training:

  • Student PPL = 90–95% of Teacher's
  • 3× faster, 1/3 the memory

7. Common Failure Points

  1. Skipping the filter — Teacher hallucinations copy into the student. A 50%+ pass rate is a safe target.
  2. Too large a teacher-student size gap — 100B → 0.5B is too much. A 3–10× ratio works well.
  3. Distillation chains — Distilling from an already-distilled model → model collapse.
  4. Domain shift in the teacher — Verify the teacher's Korean training coverage for the target domain.
  5. Not enough instruction diversity — Same prompt repeated = similar outputs from the teacher.
  6. Expecting the student to beat the teacher — The student mimics the teacher. It can't exceed the teacher's capability.

8. Ops Checklist

  • Teacher/student size ratio: 3–10×
  • Verify teacher's domain capability
  • Generate 5K–50K pairs
  • Filtering pass rate ≥ 50%
  • Human review of 100 samples
  • Evaluate student PPL + domain probes
  • 3-axis teacher vs student comparison (accuracy, speed, memory)
  • License chain check (Teacher API Terms of Service)

9. Exercises

  1. Run Qwen 2.5-1.5B → 0.5B distillation with 1,000 pairs. What's the pass rate and PPL?
  2. Compare filtering thresholds at 30% / 50% / 80%.
  3. How much does quality improve if Teacher = GPT-4 API?
  4. Compare distillation Student vs LoRA SFT (Ch 24) on the same domain.
  5. (Think about it) If the Teacher is an OpenAI model, can the Student be released under Apache 2.0?

References

  • Hinton et al. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531
  • Sanh et al. (2019). DistilBERT. arXiv:1910.01108
  • Gemma Team (2024). Gemma 2 Technical Report.
  • HuggingFace SmolLM2 / Cosmopedia blog