Training a BPE Tokenizer from Scratch¶
What you'll learn
- The BPE (Byte-Pair Encoding) algorithm — merge the most frequent pair, repeat
- Training an 8K-vocab tokenizer with HuggingFace
tokenizers - Pitfalls with non-Latin scripts — pre-tokenizer choices, efficiency measurement
- The tokenizer decision for this book's 10M model
Prerequisites
Ch 5 TinyStories training data. Ch 2 APIs vs. building yourself — the tokenization section.
1. Concept — BPE in One Page¶
Byte-Pair Encoding (introduced for NMT by Sennrich et al., 2016).
Start: each character = one token
Repeat:
1. Find the most frequent adjacent pair in the text
2. Merge that pair into a new token
3. Add it to the vocab
Stop: when the target vocab size is reached
Small example:
| Step | Text | Vocab |
|---|---|---|
| 0 | l o w</w> l o w e s t</w> (character level) |
{l, o, w, e, s, t, </w>} |
| 1 | lo w</w> lo w e s t</w> (l o → lo) |
+ lo |
| 2 | low</w> low e s t</w> (lo w → low) |
+ low |
| 3 | low</w> low es t</w> (e s → es) |
+ es |
| ... | ... |
Frequent substrings become single tokens. Rare words get split into smaller pieces. That's compression, and that's the essence.
2. Why It's Needed — The Word vs. Character Tradeoff¶
| Approach | Token count | OOV (unknown words) | Efficiency |
|---|---|---|---|
| Character-level | Very high (long sequences) | None (knows every character) | Very poor |
| Word-level | Low | Common (typos, new words = OOV) | Vocab explosion |
| BPE / WordPiece / SentencePiece | Middle ground | Almost none | Balanced |
BPE's elegance: common words get one token, rare words get subword combinations. OOV is essentially impossible (worst case: fall back to individual bytes).
It directly affects cost: in API calls, token count = cost = latency. The same sentence can be 5–15 tokens depending on the tokenizer. (See the table in Ch 2.)
3. Where It's Used — Three BPE Variants¶
| Variant | Difference | Used by |
|---|---|---|
| GPT BPE (byte-level) | Processes input as bytes. Any character, including non-Latin scripts, is treated as UTF-8 bytes — never OOV. | GPT-2/3/4, Llama, Qwen 2.5 |
| WordPiece | Merging priority is likelihood-based (BPE uses frequency). | BERT, Phi-3 |
| SentencePiece (Unigram + BPE) | Spaces are part of the token (▁the). Handles multilingual text well. |
T5, SmolLM2, Gemma 2 |
This book uses HuggingFace tokenizers's ByteLevel BPE — compatible with the GPT family, handles any language at the byte level, no OOV.
4. Minimal Example — Training 8K BPE in 30 Seconds¶
- ByteLevel — treat UTF-8 bytes as candidate token units. Non-Latin characters (e.g., "안" = 3 bytes in UTF-8) are handled transparently.
- 8K — this book's default. Smaller models benefit from a smaller vocab (lower embedding memory).
<|endoftext|>— end-of-story marker. Used as the separator between stories during training.- Iterator style — trains on large corpora without loading everything into memory.
Checking Tokenization Results¶
| check_bpe.py | |
|---|---|
Typical output (when trained on English TinyStories):
'Once upon a time'
tokens: ['Once', 'Ġupon', 'Ġa', 'Ġtime']
count: 4
'A small village in the mountains'
tokens: ['A', 'Ġsmall', 'Ġvillage', 'Ġin', 'Ġthe', 'Ġmountains']
count: 6
'Lily loved her toy car'
tokens: ['Lily', 'Ġloved', 'Ġher', 'Ġtoy', 'Ġcar']
count: 5
What to notice:
- Ġ is ByteLevel BPE's space marker (it's literally the Ġ character representing a leading space).
- If you feed in non-Latin text that wasn't in training data, it falls back to byte-level decomposition — token count increases significantly.
5. In Practice — Training with Your Domain Data¶
For the capstone (Korean story generator), you need to train the tokenizer on Korean text to get good efficiency.
Token count comparison on the same Korean sentence:
| Tokenizer | "A small village long ago" (Korean) |
|---|---|
| ByteLevel BPE (trained on English) | ~18 tokens |
| ByteLevel BPE (trained on Korean) | 6–8 tokens |
| GPT-4 cl100k_base (multilingual) | 9 tokens |
| Qwen 2.5 BPE (multilingual) | 5 tokens |
Training BPE on your own domain data beats multilingual BPE when your domain is narrow.
Byte-Level vs. Other Strategies¶
ByteLevel operates at the UTF-8 byte level, so there's no special handling needed for different scripts. Other approaches exist:
| Strategy | Pros | Cons |
|---|---|---|
| ByteLevel (this book) | Simple, no OOV, standard | Script structure not explicitly represented |
| Unicode normalization pre-split | Script morphology visible | Complex reconstruction, non-standard |
| Syllable + BPE | Intuitive | OOV on unusual characters, emoji |
This book sticks with ByteLevel — standard compatibility first.
6. Common Failure Modes¶
1. Vocab size too large — For a 10M model, vocab=32K means the embedding alone is 8M parameters (80% of the model). 8K is the right balance here. Models over 1B can use 50K–150K.
2. Missing pre-tokenizer — Using BPE without ByteLevel means OOV on any character not seen during training. Always pair ByteLevel with BPE.
3. Forgetting special tokens — Without <|endoftext|>, the model can't learn sequence boundaries during training. Don't forget <|pad|> either.
4. Training corpus too small — Training BPE on 100 documents gives too few merge operations. At least 10,000 documents is recommended.
5. Mismatched distributions — Training BPE on Wikipedia then training the model on TinyStories hurts efficiency. Same distribution is the principle.
6. Not setting decoder — tok.decode(ids) returns garbled output. Always set the ByteLevel decoder explicitly.
7. Fast vs. slow tokenizer — The tokenizers library (Rust) is fast. Wrap it as PreTrainedTokenizerFast from transformers so it works efficiently in your training loop.
7. Operational Checklist¶
Tokenizer decision gate:
- vocab_size — proportional to model size (10M = 8K, 100M = 16K, 1B = 32K–50K)
- pre-tokenizer — ByteLevel (GPT family) or SentencePiece (T5 family)
- special_tokens —
<|endoftext|>,<|pad|>, and if needed:<|user|>,<|assistant|> - Training corpus matches model training corpus distribution
- Token efficiency measurement — average tokens per sentence on 100 domain examples
-
transformerscompatibility —PreTrainedTokenizerFast(tokenizer_object=tok) - When uploading to HF Hub: both
tokenizer.json+tokenizer_config.json(Ch 22, capstone)
8. Exercises¶
- Run the §4 code as-is to train an 8K-vocab BPE, then measure how many tokens it produces for one sentence that wasn't in training. Compare with the Korean-trained version from §5.
- Train with vocab_size of 4K, 8K, 16K, and 32K. Compare average tokens per document across 100 examples. Where do you see diminishing returns?
- Encode
"Price: $1,234.56"with this book's BPE and with GPT-4tiktoken. How do they handle numbers differently? - Apply Unicode NFD normalization to decompose characters before BPE training. How does the token count change compared to the §5 table?
- (Think about it) Training on domain data improves token efficiency for that language/domain. What happens to token efficiency for other languages? How would you maintain both?
References¶
- Sennrich et al. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 (BPE origin)
- Radford et al. (2019). GPT-2. — ByteLevel BPE established
- Kudo & Richardson (2018). SentencePiece. arXiv:1808.06226
- HuggingFace
tokenizerslibrary docs - Karpathy. Let's build the GPT Tokenizer (YouTube, 2024) — hands-on BPE implementation walkthrough