What Your Laptop Can Do¶
What you'll learn
- Translate your laptop's, Colab T4's, or single workstation GPU's memory, compute, and time into model size and token count
- The Chinchilla law in one sentence — and why SLMs intentionally violate it
- The baseline math for this book: 10M model + 200M tokens = ~4 hours on M2 / T4
1. Concept — three axes you must all pass¶
Training is feasible only if you clear all three at once.
| Axis | What determines it | Failure signal |
|---|---|---|
| Memory | Model size + optimizer state + activations | OOM (out of memory) |
| Compute (FLOPs) | Token count × model size | Training that never finishes |
| Time | A function of the two above | Blows past your deadline |
The classic trap: memory passes, time doesn't. You can load a 10B quantized model on a laptop for inference — but training it is another matter. Check all three before you start.
2. Why it matters — "just try it" doesn't work here¶
Once training starts, stopping early means losing everything since the last checkpoint. Run 12 hours on a laptop, hit an OOM, and you lose 12 hours. A five-minute estimate up front pays for itself.
There's also an intentional ratio between model size and training data. A model starved of data (under-trained) underperforms. A tiny model fed far more data than it can benefit from either overfits or saturates. The balance matters.
3. Where it's used — the Chinchilla law and SLM's deliberate departure¶
Chinchilla (Hoffmann et al., 2022) in one line¶
Given a fixed compute budget, scale model size N and training tokens D together. The empirical optimum is roughly:
(1B parameters → ~20B tokens is compute-optimal.)
Why SLMs ignore this¶
Apply Chinchilla to the 10M model in this book: 200M tokens. That actually matches 20×. But for most deployed SLMs, inference cost matters more than training cost — so it's worth spending more on training to squeeze more capability out of each inference token. That's why SLMs deliberately over-train:
| Model | Parameters | Training tokens | Ratio | Approach |
|---|---|---|---|---|
| Chinchilla 70B | 70B | 1.4T | 20× | Compute-optimal |
| Llama 3 8B | 8B | 15T | ~1900× | Heavy over-training |
| SmolLM2 1.7B | 1.7B | 11T | ~6500× | Heavier over-training |
| This book 10M | 10M | 200M | 20× | Balanced (time-constrained) |
This book stops at 20× only because of time. If you have more hours, push to 100× — the model keeps improving.
4. Minimal example — memory math in 30 seconds¶
Memory during training (rough formula):
Each term in bytes:
| Term | bf16/fp16 | fp32 |
|---|---|---|
| params | 2N | 4N |
| grads | 2N | 4N |
| Adam m (1st moment) | 4N | 4N |
| Adam v (2nd moment) | 4N | 4N |
| Adam total | 12N + 2 (params) | 16N |
| activations | batch × seq × hidden (function of shape) | same |
With bf16 + Adam + grads, expect roughly 14–16 bytes/param plus activations. Examples:
- 10M model → params/grads/optimizer ≈ 160 MB + 100–500 MB activations → ~1 GB total. M2 (16 GB) and T4 (16 GB) both comfortable.
- 125M model (GPT-2 small) → ~2 GB + 1–3 GB activations → 3–5 GB. T4 fine, mobile not.
- 1B model → ~16 GB + activations → 20 GB+. T4 (16 GB) can't train it; needs A100.
5. Hands-on tutorial — baseline math for this book¶
Time = total FLOPs / device throughput¶
FLOPs per token for one training step (forward + backward):
(forward ≈ 2N, backward ≈ 4N — standard approximation from Kaplan et al., 2020)
Total FLOPs for a training run:
For this book (10M params, 200M tokens):
Effective throughput per device (real mixed-precision training, including memory bandwidth and data loading — typically 30–50% of the spec):
| Device | Spec (TFLOPS bf16) | Effective (TFLOPS) | This book's training time |
|---|---|---|---|
| M2 (CPU) | ~0.5 | 0.2 | ~17 hours |
| M2 Pro (MPS, GPU cores) | ~7 | 3 | ~1.1 hours |
| Colab T4 | 65 | 20 | ~10 minutes |
| Colab A100 | 312 | 150 | ~1.5 minutes |
The gap between spec and effective throughput comes from data loading, memory bandwidth, and non-tensor operations. 30–50% is a conservative estimate.
Baseline: M2 Pro MPS or Colab T4 finishes the book's baseline run in tens of minutes to one hour. The "4 hours" in the title is a conservative estimate that includes toolchain setup, evaluation, and debugging.
Compute it yourself¶
Expected output:
10M · 200M · M2 Pro MPS 1.11 h
10M · 200M · T4 0.17 h
30M · 600M · T4 1.50 h
125M · 2.5B · T4 10.42 h
125M · 2.5B · A100 1.39 h
Takeaways:
- The book's baseline (10M · 200M) finishes comfortably on anything.
- Pushing to 30M · 600M (same Chinchilla ratio) still fits on a free Colab T4 in 1.5 hours — well under the 12-hour session limit.
- 125M (GPT-2 small) starts straining the T4. You'd want an A100 or multiple T4s.
6. Common pitfalls¶
1. Forgetting activation memory. "params + grads + Adam = 14N" is only the weight memory. Increase batch size without checking activations and you'll OOM. Activations scale as batch × seq × hidden × ~12. At seq=512, hidden=256, batch=16, activations alone are ~1.5 GB.
2. Trusting spec FLOPs. A100 spec is 312 TFLOPS for fp16 dense matrix multiplication at peak. Real training is dominated by memory bandwidth, communication, and data loading. Effective ≈ spec × 0.4 is safe.
3. Free Colab session cuts. T4 availability isn't guaranteed. Sessions drop frequently during long runs. Without checkpoints (Ch 14), a 7-hour run lost to a disconnect is just 7 hours gone.
4. PyTorch MPS op fallback on Apple Silicon. Some ops — particularly newer attention variants — fall back to CPU on MPS. When that happens, you go from ~7 TFLOPS to 1/100 of that. Before a long run, verify with torch.backends.mps.is_available() and measure actual tokens/sec on a 100-step warmup.
5. Confusing token count with step count. step = D_tokens / (batch × seq_len × grad_accum). Chinchilla's law is stated in tokens D, not steps.
7. Production checklist¶
30-second pre-training checklist:
- Memory —
training_memory_gb(N)+ activation estimate + 30% margin → fits in device RAM? - Time —
hours_to_train(N, D, tflops)→ fits in your schedule? - Checkpoints — saving every 30 minutes or 1,000 steps? (Ch 14)
- Colab — saved to mounted Drive in case of disconnect? Aware of the 12-hour free-tier limit?
- Data — D tokens tokenized, shuffled, and cached?
- Eval set — 1–2% held out separately? (Part 5)
If the math doesn't pass:
- Memory over budget → smaller model or smaller batch + gradient accumulation (Ch 13)
- Time over budget → reduce both model and token count or rent an A100 (Colab Pro)
- Both tight → improve data quality to need fewer tokens (Part 2 Ch 7)
8. Exercises¶
- Write down your laptop's RAM, CPU, and GPU specs. Use
training_memory_gbto estimate the largest model you can train. Include a 30% activation margin. - How long would 30M model + 600M tokens (Chinchilla ratio) take on your hardware? Use
hours_to_train. Does it fit in 12 hours? - The 30–50% effective-vs-spec gap is a rule of thumb. Run a small training job (10M model, 1,000 steps) on your device and measure actual tokens/sec. What percentage of spec does it reach?
- (Think about it) What happens to memory and time if you intentionally over-train at 100× — 10M model, 1B tokens? Which axis hits its limit first?
Next¶
Now you know your hardware ceiling. The next chapter surveys the existing open-weight SLMs — their sizes, dense vs MoE structure, and where the 10M model you'll build sits among them.
Next → Ch 4 The Open-Weight SLM Landscape
Sources¶
- Hoffmann et al. (2022). Training Compute-Optimal Large Language Models. (Chinchilla) arXiv:2203.15556
- Kaplan et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361 — source of the
6Napproximation - HuggingFace SmolLM2 blog — over-training ratios
- Llama 3 model card — 8B / 15T tokens
- PyTorch MPS backend docs (Apple Silicon)