Curriculum¶

32 chapters + a capstone. Ten weeks of core work, twelve weeks total. Sized to run on an M1/M2 Mac or a Colab T4.

Part 1. Why Small Models (4 chapters)¶

#	Title	What you'll do
1	The Return of Small Models	Phi-3, SmolLM2, MobileLLM trajectory. Why "bigger is always better" broke in 2024.
2	What Differs from the API	What API callers never see. What you gain by running the model yourself.
3	What Your Laptop Can Do	Memory, compute, and time budgets. Colab T4 vs M2 vs A100 in one table.
4	The Open-Weight SLM Landscape — Size, Dense, MoE	Why 135M/360M/1.7B/3B. What dense vs MoE actually means.

Part 2. Data & Tokenizer (3 chapters)¶

#	Title	What you'll do
5	TinyStories and Synthetic Data	Eldan & Li 2023, Cosmopedia, how synthetic data makes a 1M model speak.
6	Training a BPE Tokenizer	Build an 8K-vocab tokenizer with the `tokenizers` library. Korean pitfalls.
7	Quality Beats Size	Lessons from the Phi series, FineWeb-Edu, filtering, de-duplication.

Part 3. Transformer by Hand (4 chapters)¶

#	Title	What you'll do
8	Attention Revisited	Scaled dot-product, causal mask, `F.scaled_dot_product_attention` in one line.
9	Modern Blocks: RoPE, RMSNorm, SwiGLU, GQA	Why RMSNorm instead of LayerNorm? Why SwiGLU instead of GeLU? GQA memory savings.
10	nanoGPT in 100 Lines	GPT-mini from scratch, Karpathy-style.
11	Parameter and Memory Math	"10M = how much memory?" — activation, gradient, optimizer state arithmetic.

Part 4. Training on Your Laptop (4 chapters)¶

#	Title	What you'll do
12	Training Loop and AdamW	step → grad → optimizer, cosine schedule, warmup.
13	Mixed Precision and Gradient Accumulation	bf16/fp16, `autocast`, simulating large batches on small GPUs.
14	Loss Curves and Checkpoints	Diagnosing healthy vs broken curves. Resumable saves.
15	A Four-Hour Training Run	TinyStories 200M tokens → 10M model, all the way through.

Part 5. Evaluation & Analysis (3 chapters)¶

#	Title	What you'll do
16	Beyond Perplexity	Why PPL alone isn't enough. A generation-sample review protocol.
17	Building a Tiny Benchmark	HellaSwag-tiny, domain probes, pass@k mini.
18	Peeking at Attention and Logits	Per-head attention visualization. Top-k logit tracing.

Part 6. Inference & Deployment (3 chapters)¶

#	Title	What you'll do
19	Quantization Basics	int8/int4, symmetric/asymmetric, one PTQ pass.
20	llama.cpp and GGUF	HF-compatible model → GGUF, `llama-cli`, and the conversion boundary for custom architectures.
21	Wrap Up with a Small Chatbot	CLI conversation loop, system prompt, sampling parameters.

Part 7. Fine-tuning in Practice (7 chapters)¶

Parts 1–6 give you a model built from scratch. Part 7 applies that knowledge to existing models — fitting them to your domain. Direct path to NER, classification, summarization, and ITN.

#	Title	What you'll do
22	Choosing and Using an Off-the-Shelf sLLM	Phi-3 / SmolLM2 / Gemma 2 / Qwen 2.5 / Llama 3.2 comparison + decision tree.
23	From Scratch vs Fine-tuning	Decision tree. Laptop-feasible fine-tuning size math.
24	LoRA / QLoRA Basics	Low-rank intuition + 30-minute LoRA on Qwen2.5-0.5B. QLoRA 4-bit base.
25	Classification & NER Fine-tuning (Encoder)	Domain entity extraction with KoELECTRA/mBERT.
26	Domain Summarization & Generation (Decoder LoRA + Continued Pre-training)	Qwen2.5-0.5B-Instruct LoRA + continued pre-training.
27	Distillation Mini	Teacher (1.7B) → Student (135M) SFT. The path SmolLM2 and Gemma 2 actually took.
28	Seq2seq Mini — ITN	byT5/T5-small + synthetic pairs. One pass through encoder-decoder.

DPO and RLHF are out of scope — see the sister book AI Assistant Engineering Part 7.

Part 8. Production (4 chapters)¶

The model isn't what makes production hard. Data, evaluation, serving, and monitoring decide whether it survives. These four chapters are what "running in production" actually means.

#	Title	What you'll do
29	Data Pipeline — PII, Synthetic, IAA	PII masking, LLM-synthetic labels, inter-annotator agreement mini.
30	Regression, Out-of-Distribution, A/B	Regression sets, hold-out, adversarial, small A/B design.
31	Serving — llama.cpp server, vLLM, Latency Budget	p50/p95 budget, batching, concurrency. Laptop to single in-house GPU.
32	Monitoring, Feedback Loop, Cost	Hallucination, drift, feedback integration + GPU time / license / PII cost model.

Capstone¶

My Own Domain SLM — finish one of two honest end-to-end tracks:

A · From scratch: data → BPE → GPTMini training and evaluation → reproducible PyTorch package → Hugging Face Hub
B · Compatible deployment: choose an HF-compatible sLLM → domain fine-tuning → evaluation → GGUF → llama.cpp → Hub + demo

The book does not pretend that a custom model automatically converts to GGUF. The deployment format depends on whether the converter and runtime support the architecture.