Skip to content

The Return of Small Models

Open in Colab

What you'll learn

  • Why the "bigger is always better" assumption broke around 2024 — three forces: data quality, synthetic data, and distillation
  • Where SLMs (Small Language Models) stand right now — Phi-3-mini, SmolLM2, MobileLLM, Gemma 2-2B on the map
  • The target scale for what you'll build — where a 10M-parameter model sits in the wider landscape

1. Concept — what counts as "small"

Small Language Model (SLM) means "small relative to frontier LLMs" — there's no official cutoff. In 2026, the industry roughly uses 100M to 7B parameters as the range. GPT-4, Claude, and Gemini Ultra sit at hundreds of billions to a trillion parameters, so SLMs are roughly 1/1000 to 1/100 of that scale.

The model you'll build in this book is even smaller than that range: 10M parameters. The only reason is that you need to run it end-to-end on a laptop in under four hours.

Tier Parameters Examples Distance from you
Frontier 200B+ GPT-4, Claude Opus, Gemini Ultra Different planet
Large 7B–70B Llama 3 70B, Mistral 7B Needs a big GPU
SLM 100M–7B Phi-3-mini 3.8B, SmolLM2 1.7B, Gemma 2-2B Inference on a laptop: yes
Tiny 1M–100M TinyStories 1M–33M, SmolLM2-135M You'll train this

"Is what we're building a real LLM?" — yes. Same transformer architecture, same tokenizer, same training loop, same evaluation procedure. The only difference is a narrower range of things it can say. A 1M model trained on TinyStories writes short children's stories. It can't write code. That's what "small" means in practice.


2. Why small models came back — three forces

From 2020 to 2022 the mood in ML was scale, scale, scale. After GPT-3 (175B), everyone raced toward bigger models. Then around late 2023 something shifted: models 10× smaller started matching the same capabilities. Three forces hit at the same time.

Three forces behind the small-model revival Three forces behind the small-model revival

Force 1. Data quality partially replaces scale

Microsoft's Phi series (2023–2024) is the clearest example. Phi-1 (1.3B) trained only on "textbook-quality" synthetic code data and beat much larger general models on HumanEval. Phi-2 (2.7B) and Phi-3-mini (3.8B) extended the same idea.

"Textbooks Are All You Need" — Gunasekar et al., 2023, the title of the Phi-1 paper

The core claim: given the same token count, training a smaller model longer on carefully curated data wins. The Chinchilla (2022) "compute-optimal" ratio was intentionally exceeded — over-training became the SLM standard.

Force 2. Synthetic data went mainstream

TinyStories (Eldan & Li, 2023) was the breakthrough. A 1M-parameter model trained only on synthetic children's stories generated by GPT-3.5 produced coherent narratives. It proved you could go that small and still have something that worked.

What followed:

  • Cosmopedia (HuggingFace, 2024) — 30B tokens of synthetic textbooks, blog posts, and stories
  • FineWeb-Edu (HuggingFace, 2024) — 1.3T tokens filtered from web crawls by "educational value" score
  • Phi-3's synthetic data fraction (exact ratio undisclosed, but "substantial")

The era of pouring raw web dumps into a model is ending. What you feed it matters as much as model size.

Force 3. Distillation became standard practice

You use a big model to teach a small one. Gemma 2-2B (Google, 2024) distilled from a larger Gemma 2. SmolLM2 (HuggingFace, 2024–2025) also leaned on synthetic and distilled data. Distillation is now something you do by default when building a small model.

This book doesn't implement distillation (the name comes up, nothing more). Instead you'll directly experience two of the three forces: synthetic data (TinyStories) + intentional over-training.


3. Where things stand — SLM coordinates in 2026

Model Parameters Released Strengths Weaknesses
Phi-3-mini 3.8B 2024-04 "Textbook data" effect, strong reasoning Weak on non-English
Phi-3.5-mini 3.8B 2024-08 128K context, better multilingual Still English-centric
SmolLM2-135M / 360M / 1.7B 0.135–1.7B 2024-11 Laptop inference, open training recipe More hallucination at smaller sizes
MobileLLM 125M / 350M 2024-04 (Meta) Sub-billion architecture research (deep & thin) Research only, not for general use
Gemma 2-2B 2B 2024-07 Distilled from a larger sibling License restrictions (Gemma license)
Llama 3.2-1B / 3B 1B / 3B 2024-09 Mobile target, tool calling 1B is weak at reasoning

The 10M model you'll build doesn't appear anywhere in this table — it's much smaller than even SmolLM2-135M. That makes SmolLM2-135M / TinyStories-33M the comparison targets: if you feed the same data (TinyStories), do you get similar story quality? That's your sanity check.


4. Minimal example — run SmolLM2-135M

Before building anything, let's see what a model at this scale actually produces. SmolLM2-135M takes about 30 seconds to load on Colab.

hello_smollm.py
# pip install -q transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer  # (1)!
import torch

name = "HuggingFaceTB/SmolLM2-135M"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, torch_dtype=torch.float32)  # (2)!

prompt = "Once upon a time"
ids = tok(prompt, return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=50, do_sample=True, temperature=0.8, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
  1. Only transformers and torch needed. Works on free Colab CPU.
  2. 135M × 4 bytes ≈ 540 MB. Fits easily in free Colab RAM (12 GB).

Typical output:

Once upon a time, there was a little girl who lived in a small village. She loved to play in the fields and chase butterflies. One day, she found a small kitten under a tree...

Three things to notice:

  1. English stories are fluent — the model's training domain was TinyStories.
  2. Feed it a non-English prompt and the tokens exist but the output breaks down. Training data was overwhelmingly English.
  3. Run the same prompt five times and you get five different continuations — sampling is probabilistic (covered in Part 5).

5. Hands-on tutorial — compare three sizes

Feed the same prompt to SmolLM2 at 135M, 360M, and 1.7B to see where "makes sense" begins. Colab T4 can hold 1.7B in RAM (4 bytes × 1.7B ≈ 6.8 GB).

size_compare.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "The reason small language models came back in 2024 is"
sizes = ["135M", "360M", "1.7B"]
for s in sizes:
    name = f"HuggingFaceTB/SmolLM2-{s}"
    tok = AutoTokenizer.from_pretrained(name)
    model = AutoModelForCausalLM.from_pretrained(name, torch_dtype=torch.bfloat16)  # (1)!
    ids = tok(prompt, return_tensors="pt").input_ids
    out = model.generate(
        ids, max_new_tokens=80,
        do_sample=False,  # (2)!
        repetition_penalty=1.1,
    )
    print(f"\n=== {s} ===")
    print(tok.decode(out[0], skip_special_tokens=True))
    del model; torch.cuda.empty_cache()  # (3)!
  1. bfloat16 halves memory. T4 prefers fp16, but SmolLM2 is bf16-compatible.
  2. Greedy decoding — deterministic for fair comparison.
  3. Free memory before loading the next model.

What to expect:

  • 135M — "the reason small language models came back is the reason small language models..." — repetition loops are common.
  • 360M — one plausible-sounding sentence, then topic drift.
  • 1.7B — "...because of better data curation, distillation, and a focus on quality over quantity." — conceptually correct answers appear much more often.

Observation 1: Capability increases smoothly with size, but there's a threshold below which sentences stop being coherent. For general English text that threshold is roughly 300M–500M.

Observation 2: Your 10M model won't come close to these scores. But restrict the domain to TinyStories children's stories and 10M is enough to write a coherent page-long story. That's what Eldan & Li demonstrated.


6. Common pitfalls

1. "Small models work like a general LLM." They don't. SLMs have a narrow domain and hallucinate frequently. You can't drop one into a chatbot backend as-is — they're typically limited to classification, extraction, or short controlled generation.

2. Context windows are tiny. SmolLM2-135M has a 2K token context (Phi-3.5 is an exception at 128K). You can't stuff long documents into these models for RAG.

3. Non-English is weak. Training data is overwhelmingly English. If you want good Korean (or another language), you need to continue training on your own data. The capstone shows you one path to do that.

4. "X GB of RAM fits an XB model" is an oversimplification. The KV cache adds memory on top of model weights at inference time. A 1.7B model at fp16 is 3.4 GB, but an 8K context adds roughly 1 GB of KV cache on top. (Ch 11 works through the exact math.)


7. Production checklist

Device Inference (approx.) Training from scratch
Mobile (4 GB) 135M – 1B (int4) Nearly impossible
Laptop CPU/M2 1B – 7B (int4/int8) 1M – 30M (this book's range)
Colab T4 (16 GB) 1B – 13B (int4) 30M – 200M
Single A100 (80 GB) 70B (int4) 7B

Training uses 6–12× more memory than inference (gradients + Adam state). The real ceiling for training on a laptop is around 30M parameters. That's why this book uses 10M as the baseline.


8. Exercises

  1. Feed SmolLM2-135M, 360M, and 1.7B the same non-English prompt (try "Once upon a time in a small village" in any language you know). Observe at which size the output breaks down, and summarize your finding in one sentence.
  2. Phi-3-mini (3.8B) and SmolLM2-1.7B have similar parameter counts. Their strengths and weaknesses differ. Based on their papers or blog posts, write a paragraph on how their training data compositions differ.
  3. Add your own laptop to the tier table in §1. Based on its RAM and CPU/GPU, which row fits?
  4. (Think about it) If distillation improved enough that a 1M model could reach 80% of GPT-4's capability, which chapters in this book would lose relevance — and which would become more important?

Sources

  • Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759
  • Gunasekar et al. (2023). Textbooks Are All You Need. (Phi-1) arXiv:2306.11644
  • Abdin et al. (2024). Phi-3 Technical Report. arXiv:2404.14219
  • Liu et al. (2024). MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. (Meta) arXiv:2402.14905
  • Hoffmann et al. (2022). Training Compute-Optimal Large Language Models. (Chinchilla) arXiv:2203.15556
  • HuggingFace SmolLM2 blog (2024–2025) · Cosmopedia / FineWeb-Edu dataset cards
  • Gemma Team. (2024). Gemma 2: Improving Open Language Models at a Practical Size.