Skip to content

Curriculum

32 chapters + a capstone. Ten weeks of core work, twelve weeks total. Sized to run on an M1/M2 Mac or a Colab T4.

Part 1. Why Small Models (4 chapters)

# Title What you'll do
1 The Return of Small Models Phi-3, SmolLM2, MobileLLM trajectory. Why "bigger is always better" broke in 2024.
2 What Differs from the API What API callers never see. What you gain by running the model yourself.
3 What Your Laptop Can Do Memory, compute, and time budgets. Colab T4 vs M2 vs A100 in one table.
4 The Open-Weight SLM Landscape — Size, Dense, MoE Why 135M/360M/1.7B/3B. What dense vs MoE actually means.

Part 2. Data & Tokenizer (3 chapters)

# Title What you'll do
5 TinyStories and Synthetic Data Eldan & Li 2023, Cosmopedia, how synthetic data makes a 1M model speak.
6 Training a BPE Tokenizer Build an 8K-vocab tokenizer with the tokenizers library. Korean pitfalls.
7 Quality Beats Size Lessons from the Phi series, FineWeb-Edu, filtering, de-duplication.

Part 3. Transformer by Hand (4 chapters)

# Title What you'll do
8 Attention Revisited Scaled dot-product, causal mask, F.scaled_dot_product_attention in one line.
9 Modern Blocks: RoPE, RMSNorm, SwiGLU, GQA Why RMSNorm instead of LayerNorm? Why SwiGLU instead of GeLU? GQA memory savings.
10 nanoGPT in 100 Lines GPT-mini from scratch, Karpathy-style.
11 Parameter and Memory Math "10M = how much memory?" — activation, gradient, optimizer state arithmetic.

Part 4. Training on Your Laptop (4 chapters)

# Title What you'll do
12 Training Loop and AdamW step → grad → optimizer, cosine schedule, warmup.
13 Mixed Precision and Gradient Accumulation bf16/fp16, autocast, simulating large batches on small GPUs.
14 Loss Curves and Checkpoints Diagnosing healthy vs broken curves. Resumable saves.
15 A Four-Hour Training Run TinyStories 200M tokens → 10M model, all the way through.

Part 5. Evaluation & Analysis (3 chapters)

# Title What you'll do
16 Beyond Perplexity Why PPL alone isn't enough. A generation-sample review protocol.
17 Building a Tiny Benchmark HellaSwag-tiny, domain probes, pass@k mini.
18 Peeking at Attention and Logits Per-head attention visualization. Top-k logit tracing.

Part 6. Inference & Deployment (3 chapters)

# Title What you'll do
19 Quantization Basics int8/int4, symmetric/asymmetric, one PTQ pass.
20 llama.cpp and GGUF HF → GGUF conversion, running with llama-cli.
21 Wrap Up with a Small Chatbot CLI conversation loop, system prompt, sampling parameters.

Part 7. Fine-tuning in Practice (7 chapters)

Parts 1–6 give you a model built from scratch. Part 7 applies that knowledge to existing models — fitting them to your domain. Direct path to NER, classification, summarization, and ITN.

# Title What you'll do
22 Choosing and Using an Off-the-Shelf sLLM Phi-3 / SmolLM2 / Gemma 2 / Qwen 2.5 / Llama 3.2 comparison + decision tree.
23 From Scratch vs Fine-tuning Decision tree. Laptop-feasible fine-tuning size math.
24 LoRA / QLoRA Basics Low-rank intuition + 30-minute LoRA on Qwen2.5-0.5B. QLoRA 4-bit base.
25 Classification & NER Fine-tuning (Encoder) Domain entity extraction with KoELECTRA/mBERT.
26 Domain Summarization & Generation (Decoder LoRA + Continued Pre-training) Qwen2.5-0.5B-Instruct LoRA + continued pre-training.
27 Distillation Mini Teacher (1.7B) → Student (135M) SFT. The path SmolLM2 and Gemma 2 actually took.
28 Seq2seq Mini — ITN byT5/T5-small + synthetic pairs. One pass through encoder-decoder.

DPO and RLHF are out of scope — see the sister book AI Assistant Engineering Part 7.

Part 8. Production (4 chapters)

The model isn't what makes production hard. Data, evaluation, serving, and monitoring decide whether it survives. These four chapters are what "running in production" actually means.

# Title What you'll do
29 Data Pipeline — PII, Synthetic, IAA PII masking, LLM-synthetic labels, inter-annotator agreement mini.
30 Regression, Out-of-Distribution, A/B Regression sets, hold-out, adversarial, small A/B design.
31 Serving — llama.cpp server, vLLM, Latency Budget p50/p95 budget, batching, concurrency. Laptop to single in-house GPU.
32 Monitoring, Feedback Loop, Cost Hallucination, drift, feedback integration + GPU time / license / PII cost model.

Capstone

My Own Domain SLM — full cycle: data collection → BPE training → model training → evaluation → quantization → GGUF → publish to HuggingFace Hub → demo. Your model becomes the next person's "off-the-shelf sLLM."