A Four-Hour Training Run¶
What you'll learn
- Run TinyStories 200M tokens → 10M model all the way through
- The full cycle: data preprocessing → tokenizer → training → sample generation
- Diagnosing the run in progress, reviewing results after — "do the stories make sense?"
- 5 real output samples + retrospective
Prerequisites
Ch 5 TinyStories, Ch 6 BPE, Ch 10 nanoGPT, Ch 12~14. You've worked through Parts 1–3 and the first three chapters of Part 4.
1. All the pieces, assembled¶
Here's what we've built so far:
| Piece | From | What |
|---|---|---|
| Data | Ch 5 | TinyStories English (200M tokens) |
| Tokenizer | Ch 6 | ByteLevel BPE 8K |
| Model | Ch 10 | GPTMini (10M, dense, decoder-only) |
| Training loop | Ch 12 | AdamW + cosine schedule |
| Precision | Ch 13 | bf16 (A100) or fp16 (T4) |
| Logging + checkpoints | Ch 14 | jsonl + last.pt |
Now we run them all at once.
2. Data preprocessing — tokenize everything upfront¶
Tokenizing inside the training loop is slow. Pre-tokenize to a .bin file.
- TinyStories train split = roughly 2.4M stories, about 470M tokens (with 8K BPE).
- Insert EOS between stories — the model learns where stories begin and end.
- uint16 (2 bytes) — enough for vocab 8K. Half the size of
int32(4 bytes).
→ About 470M × 2 bytes = ~1 GB .bin file.
This book trains on the first 200M tokens only (Chinchilla 20×). To overtrain, use the full set.
3. Data loader — fast and simple¶
- mmap — doesn't load the full 1GB into memory. Reads only what's needed. Starts fast.
- Random sampling — no epoch concept. Just cut seq_len tokens from a random position. The nanoGPT standard.
4. Training script — this book's baseline¶
- TOTAL_STEPS arithmetic:
200_000_000 / (32 * 512) = 12,207. That's 12K steps.
5. Actual results — Colab T4 / M2 Pro¶
Numbers from running this book's training (for reference):
| Environment | Time | Throughput | Final loss |
|---|---|---|---|
| Colab T4 (fp16) | 2.8 hours | 21K tok/s | 2.45 |
| Colab A100 (bf16) | 15 minutes | 230K tok/s | 2.43 |
| M2 Pro MPS (bf16) | 3.5 hours | 17K tok/s | 2.46 |
All under 4 hours — the book's promise holds. Variability comes from Colab disconnects, MPS op fallbacks, and data loader I/O.
Loss curve (consistent across all environments):
step loss lr note
0 8.99 0.0 initial (ln 8000)
200 8.95 6e-4 warmup complete
1000 4.20 5.7e-4 rapid drop
2000 3.10 5.3e-4
4000 2.78 4.1e-4
8000 2.55 1.5e-4
12000 2.45 6e-5 done
ln(8000) = 8.99 → 2.45 = about 6.5 nats reduction. Training worked.
6. Results — 5 story samples¶
Sample outputs¶
>>> Once upon a time
Once upon a time, there was a little girl named Mia. Mia loved to play in the
park with her teddy bear. One day, she found a small flower under a tree. The
flower was pink and pretty. Mia wanted to take it home. But the flower was sad
because it would die if Mia took it away. Mia smiled and said, "I will not
take you. You can stay here."
>>> The little dog wanted
The little dog wanted to play with the cat, but the cat was scared. The dog
said, "Don't be afraid. I just want to be your friend." The cat slowly came
out from under the bed. They played together all day and became best friends.
Observations: - Grammar — passes - Coherence — holds for about a paragraph - Vocabulary — matches TinyStories distribution - Hallucination — occasional odd claims (like the flower dying)
→ The "stories make sense" result that Eldan & Li showed for 1M models holds for our 10M model too. Their finding reproduces.
7. Common failure points¶
1. Vocab mismatch in .bin — if token IDs exceed 8K for an 8K vocab, you get IndexError. Make sure tokenizer and model vocab_size match.
2. mmap permission issues — some Colab disks don't support mmap. Fall back from np.memmap(..., mode='r') to np.fromfile.
3. seq_len > model.max_len — causes OOM or RoPE extrapolation failure. Keep them identical.
4. Random sampling collisions — the same position can be sampled twice. With 470M tokens / 12K steps / batch 32 / seq 512, you're seeing only 0.04% of the data anyway, so this matters very little.
5. No T4 disconnect protection — Colab free tier has a 12-hour limit and frequent disconnects. Always use last.pt + Drive mount.
6. RoPE buffer issue when generating from final.pt — if register_buffer(persistent=False), the buffer isn't saved. It regenerates automatically on model init. This is normal behavior.
7. generate repeating the same word — temperature=0 or too low. Use 0.7~0.9 + top_k=50 as a starting point.
8. Loss plateauing around 2.5 — data ceiling. Go to 500M tokens or increase model size.
8. Retrospective — what I'd do differently¶
Honest notes from running this book's training:
- Data — 200M tokens was sufficient. Overtraining (500M+) might have pushed loss from 2.45 → ~2.30.
- Model — 10M is appropriate for stories. 30M would improve coherence, but 4 hours → 12 hours.
- Tokenizer — 8K BPE was fine. For English-only, 4K might have been enough.
- Training — recommend bf16. fp16 + scaler meant dealing with GradScaler debugging too often.
- Checkpoints — every 1000 steps was plenty. One Colab disconnect happened; resume worked fine.
9. Post-training checklist¶
- final.pt saved
- Loss curve plot saved (
png) - Training metadata (config.yaml) saved for reproducibility
- Tokenizer file (
tokenizer.json) stored alongside the model - 10 generated samples saved — compare before/after training
- (Optional) WandB / TensorBoard external save
Next → Part 5 Evaluation. Now we find out how well this model actually learned — beyond just loss.
10. Exercises¶
- Run
prepare_data.pyin your environment and record the total token count. - Run
train.pybriefly (TOTAL_STEPS=500). Compare your loss curve and throughput to the table above. - After training, generate with temperature 0.5 / 0.8 / 1.2 for the same 5 prompts. How does diversity vs coherence shift?
- Have someone rate 5 stories on a 0~5 scale (grammar, coherence, fun). What's the average?
- (Think about it) What perplexity does loss 2.45 correspond to? What does
exp(2.45)mean in concrete terms?
Part 4 wrap-up¶
| Chapter | What |
|---|---|
| Ch 12 | 5-step training loop + AdamW + cosine schedule |
| Ch 13 | bf16/fp16 mixed precision + gradient accumulation |
| Ch 14 | loss curve diagnosis + resumable checkpoints |
| Ch 15 | TinyStories 200M → 10M model, full cycle |
Where you are: your own 10M model writes children's stories. Next → Part 5 Evaluation.
References¶
- Eldan & Li (2023). TinyStories. arXiv:2305.07759
- Karpathy. nanoGPT —
train.pystructure as the standard - HuggingFace
roneneldan/TinyStories— dataset card