The Return of Small Models¶
What you'll learn
- Why the "bigger is always better" assumption broke around 2024 — three forces: data quality, synthetic data, and distillation
- Where SLMs (Small Language Models) stand right now — Phi-3-mini, SmolLM2, MobileLLM, Gemma 2-2B on the map
- The target scale for what you'll build — where a 10M-parameter model sits in the wider landscape
1. Concept — what counts as "small"¶
Small Language Model (SLM) means "small relative to frontier LLMs" — there's no official cutoff. In 2026, the industry roughly uses 100M to 7B parameters as the range. GPT-4, Claude, and Gemini Ultra sit at hundreds of billions to a trillion parameters, so SLMs are roughly 1/1000 to 1/100 of that scale.
The model you'll build in this book is even smaller than that range: 10M parameters. The only reason is that you need to run it end-to-end on a laptop in under four hours.
| Tier | Parameters | Examples | Distance from you |
|---|---|---|---|
| Frontier | 200B+ | GPT-4, Claude Opus, Gemini Ultra | Different planet |
| Large | 7B–70B | Llama 3 70B, Mistral 7B | Needs a big GPU |
| SLM | 100M–7B | Phi-3-mini 3.8B, SmolLM2 1.7B, Gemma 2-2B | Inference on a laptop: yes |
| Tiny | 1M–100M | TinyStories 1M–33M, SmolLM2-135M | You'll train this |
"Is what we're building a real LLM?" — yes. Same transformer architecture, same tokenizer, same training loop, same evaluation procedure. The only difference is a narrower range of things it can say. A 1M model trained on TinyStories writes short children's stories. It can't write code. That's what "small" means in practice.
2. Why small models came back — three forces¶
From 2020 to 2022 the mood in ML was scale, scale, scale. After GPT-3 (175B), everyone raced toward bigger models. Then around late 2023 something shifted: models 10× smaller started matching the same capabilities. Three forces hit at the same time.
Force 1. Data quality partially replaces scale¶
Microsoft's Phi series (2023–2024) is the clearest example. Phi-1 (1.3B) trained only on "textbook-quality" synthetic code data and beat much larger general models on HumanEval. Phi-2 (2.7B) and Phi-3-mini (3.8B) extended the same idea.
"Textbooks Are All You Need" — Gunasekar et al., 2023, the title of the Phi-1 paper
The core claim: given the same token count, training a smaller model longer on carefully curated data wins. The Chinchilla (2022) "compute-optimal" ratio was intentionally exceeded — over-training became the SLM standard.
Force 2. Synthetic data went mainstream¶
TinyStories (Eldan & Li, 2023) was the breakthrough. A 1M-parameter model trained only on synthetic children's stories generated by GPT-3.5 produced coherent narratives. It proved you could go that small and still have something that worked.
What followed:
- Cosmopedia (HuggingFace, 2024) — 30B tokens of synthetic textbooks, blog posts, and stories
- FineWeb-Edu (HuggingFace, 2024) — 1.3T tokens filtered from web crawls by "educational value" score
- Phi-3's synthetic data fraction (exact ratio undisclosed, but "substantial")
The era of pouring raw web dumps into a model is ending. What you feed it matters as much as model size.
Force 3. Distillation became standard practice¶
You use a big model to teach a small one. Gemma 2-2B (Google, 2024) distilled from a larger Gemma 2. SmolLM2 (HuggingFace, 2024–2025) also leaned on synthetic and distilled data. Distillation is now something you do by default when building a small model.
This book doesn't implement distillation (the name comes up, nothing more). Instead you'll directly experience two of the three forces: synthetic data (TinyStories) + intentional over-training.
3. Where things stand — SLM coordinates in 2026¶
| Model | Parameters | Released | Strengths | Weaknesses |
|---|---|---|---|---|
| Phi-3-mini | 3.8B | 2024-04 | "Textbook data" effect, strong reasoning | Weak on non-English |
| Phi-3.5-mini | 3.8B | 2024-08 | 128K context, better multilingual | Still English-centric |
| SmolLM2-135M / 360M / 1.7B | 0.135–1.7B | 2024-11 | Laptop inference, open training recipe | More hallucination at smaller sizes |
| MobileLLM | 125M / 350M | 2024-04 (Meta) | Sub-billion architecture research (deep & thin) | Research only, not for general use |
| Gemma 2-2B | 2B | 2024-07 | Distilled from a larger sibling | License restrictions (Gemma license) |
| Llama 3.2-1B / 3B | 1B / 3B | 2024-09 | Mobile target, tool calling | 1B is weak at reasoning |
The 10M model you'll build doesn't appear anywhere in this table — it's much smaller than even SmolLM2-135M. That makes SmolLM2-135M / TinyStories-33M the comparison targets: if you feed the same data (TinyStories), do you get similar story quality? That's your sanity check.
4. Minimal example — run SmolLM2-135M¶
Before building anything, let's see what a model at this scale actually produces. SmolLM2-135M takes about 30 seconds to load on Colab.
- Only
transformersandtorchneeded. Works on free Colab CPU. - 135M × 4 bytes ≈ 540 MB. Fits easily in free Colab RAM (12 GB).
Typical output:
Once upon a time, there was a little girl who lived in a small village. She loved to play in the fields and chase butterflies. One day, she found a small kitten under a tree...
Three things to notice:
- English stories are fluent — the model's training domain was TinyStories.
- Feed it a non-English prompt and the tokens exist but the output breaks down. Training data was overwhelmingly English.
- Run the same prompt five times and you get five different continuations — sampling is probabilistic (covered in Part 5).
5. Hands-on tutorial — compare three sizes¶
Feed the same prompt to SmolLM2 at 135M, 360M, and 1.7B to see where "makes sense" begins. Colab T4 can hold 1.7B in RAM (4 bytes × 1.7B ≈ 6.8 GB).
bfloat16halves memory. T4 prefers fp16, but SmolLM2 is bf16-compatible.- Greedy decoding — deterministic for fair comparison.
- Free memory before loading the next model.
What to expect:
- 135M — "the reason small language models came back is the reason small language models..." — repetition loops are common.
- 360M — one plausible-sounding sentence, then topic drift.
- 1.7B — "...because of better data curation, distillation, and a focus on quality over quantity." — conceptually correct answers appear much more often.
Observation 1: Capability increases smoothly with size, but there's a threshold below which sentences stop being coherent. For general English text that threshold is roughly 300M–500M.
Observation 2: Your 10M model won't come close to these scores. But restrict the domain to TinyStories children's stories and 10M is enough to write a coherent page-long story. That's what Eldan & Li demonstrated.
6. Common pitfalls¶
1. "Small models work like a general LLM." They don't. SLMs have a narrow domain and hallucinate frequently. You can't drop one into a chatbot backend as-is — they're typically limited to classification, extraction, or short controlled generation.
2. Context windows are tiny. SmolLM2-135M has a 2K token context (Phi-3.5 is an exception at 128K). You can't stuff long documents into these models for RAG.
3. Non-English is weak. Training data is overwhelmingly English. If you want good Korean (or another language), you need to continue training on your own data. The capstone shows you one path to do that.
4. "X GB of RAM fits an XB model" is an oversimplification. The KV cache adds memory on top of model weights at inference time. A 1.7B model at fp16 is 3.4 GB, but an 8K context adds roughly 1 GB of KV cache on top. (Ch 11 works through the exact math.)
7. Production checklist¶
| Device | Inference (approx.) | Training from scratch |
|---|---|---|
| Mobile (4 GB) | 135M – 1B (int4) | Nearly impossible |
| Laptop CPU/M2 | 1B – 7B (int4/int8) | 1M – 30M (this book's range) |
| Colab T4 (16 GB) | 1B – 13B (int4) | 30M – 200M |
| Single A100 (80 GB) | 70B (int4) | 7B |
Training uses 6–12× more memory than inference (gradients + Adam state). The real ceiling for training on a laptop is around 30M parameters. That's why this book uses 10M as the baseline.
8. Exercises¶
- Feed SmolLM2-135M, 360M, and 1.7B the same non-English prompt (try "Once upon a time in a small village" in any language you know). Observe at which size the output breaks down, and summarize your finding in one sentence.
- Phi-3-mini (3.8B) and SmolLM2-1.7B have similar parameter counts. Their strengths and weaknesses differ. Based on their papers or blog posts, write a paragraph on how their training data compositions differ.
- Add your own laptop to the tier table in §1. Based on its RAM and CPU/GPU, which row fits?
- (Think about it) If distillation improved enough that a 1M model could reach 80% of GPT-4's capability, which chapters in this book would lose relevance — and which would become more important?
Sources¶
- Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759
- Gunasekar et al. (2023). Textbooks Are All You Need. (Phi-1) arXiv:2306.11644
- Abdin et al. (2024). Phi-3 Technical Report. arXiv:2404.14219
- Liu et al. (2024). MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. (Meta) arXiv:2402.14905
- Hoffmann et al. (2022). Training Compute-Optimal Large Language Models. (Chinchilla) arXiv:2203.15556
- HuggingFace SmolLM2 blog (2024–2025) · Cosmopedia / FineWeb-Edu dataset cards
- Gemma Team. (2024). Gemma 2: Improving Open Language Models at a Practical Size.