Seq2seq Mini — ITN¶
What you'll learn
- Seq2seq (encoder-decoder) — the third architecture, different from encoder-only (BERT) and decoder-only (GPT)
- ITN (Inverse Text Normalization) — "zero one zero" → "010", "twenty twenty-six" → "2026"
- Train an ITN model with byT5-small + synthetic pairs
- Direct domain applications: STT post-processing, translation, summarization (long input → short output)
Prerequisites
Ch 8 Attention, Ch 24 LoRA. Encoder vs decoder differences (Ch 25).
1. Concept — The Third Shape¶
| Architecture | Encoder | Decoder | Best for |
|---|---|---|---|
| Encoder-only (BERT) | ✓ | × | Classification / NER (Ch 25) |
| Decoder-only (GPT) | × | ✓ | Generation (Ch 24) |
| Encoder-Decoder (T5) | ✓ | ✓ | Transformation (translation, summarization, ITN) |
The core of seq2seq: the encoder reads the entire input bidirectionally + the decoder generates the output + cross-attention connects them.
2. Why Seq2seq Fits Transformation Tasks¶
| Task aspect | decoder-only | seq2seq |
|---|---|---|
| Input length ≠ output length | possible, but awkward | natural |
| Bidirectional understanding of input | weak (causal mask) | strong |
| Translation, summarization, ITN | △ | ◎ |
ITN ("zero one zero" → "010") needs to read the full input in both directions to produce the numeric form. Seq2seq handles this naturally.
Decoder-only can do ITN too — with an instruction: zero one zero → format. But for models under 300M parameters, seq2seq usually wins.
3. Model Options¶
| Model | Parameters | Notes | License |
|---|---|---|---|
| t5-small | 60M | English-focused | Apache 2.0 |
| t5-base | 220M | English | Apache 2.0 |
| byT5-small | 300M | byte-level (no tokenizer needed), multilingual | Apache 2.0 |
| mt5-small | 300M | 100 languages | Apache 2.0 |
Recommended for Korean ITN: byT5-small — byte-level means no OOV for Korean / numerals / Chinese characters. Strong for character-level tasks like ITN.
4. ITN Task Definition¶
| Input (spoken) | Output (written) |
|---|---|
| zero one zero one two three four five six seven eight | 010 1234 5678 |
| twenty twenty-six April | 2026년 4월 |
| one hundred forty thousand won | 14만원 |
| seven percent | 7% |
| zero point five | 0.5 |
Rule-based FST approaches exist, but for Korean ITN, ambiguity makes learned models superior:
- "이" → 2 (numeral) or "이" (grammatical particle) — context-dependent
- "백" → 100 (numeral) or "백" (a Korean surname)
5. Synthetic Data — 10K Pairs¶
In practice, Korean numeral conversion is complex (이천이십육 vs 2026, etc.). The AI Hub ITN dataset from KAIST and others is recommended for real use.
6. Training — byT5-small Fine-tune¶
Training time: T4, 10K pairs × 5 epochs ≈ 30 minutes.
7. Inference + Evaluation¶
Typical results (trained on 10K pairs): EM ≈ 88–95% (narrow domain, sufficient synthetic data).
8. Common Failure Points¶
- Padding tokens in labels — Must be set to
-100so they're ignored in the loss.DataCollatorForSeq2Seqhandles this automatically. - byT5's byte-level tokenization — Token count is 5–10× higher. Use generous seq_len (128+).
- Limits of synthetic data — Ambiguous cases (이 = 2 or grammatical particle) won't be in the training data. Augment with real STT output.
- Inference speed of encoder-decoder — Slower than decoder-only (2-step process). But a small model (300M) is fine for production.
- Only measuring EM — Partial matches (e.g., "010-1234-5678" → "010-1234-567") still carry meaning. Report edit distance* too.
- Trying ITN with decoder-only — Qwen 0.5B with ITN LoRA also works. With enough data, performance is comparable.
9. Ops Checklist¶
ITN model ops gate:
- 10K+ synthetic pairs
- 1,000+ real STT output pairs (if available)
- EM + edit distance as dual metrics
- Per-category accuracy (phone / amount / date)
- Inference speed (single sentence, p95)
- Verify training distribution matches STT output distribution
- (Part 8, Ch 30) Drift monitoring — new terminology, etc.
10. Exercises¶
- Train byT5-small ITN on 10K pairs. Measure EM.
- Train a decoder-only LoRA (Qwen 0.5B) on the same data. Compare EM with seq2seq.
- Plot the EM learning curve for 1K / 5K / 10K / 50K training samples.
- Compare byT5-small vs t5-small (English) for Korean ITN. How different is the performance?
- (Think about it) What other tasks in your domain would fit seq2seq? Beyond STT post-processing and translation.
Part 7 Wrap-Up¶
| Chapter | What you did |
|---|---|
| Ch 22 | Compare 5 off-the-shelf sLLMs + decision tree |
| Ch 23 | From-scratch vs fine-tuning — laptop memory math |
| Ch 24 | LoRA / QLoRA — 30 lines with peft |
| Ch 25 | Encoder NER — domain entity extraction |
| Ch 26 | Decoder LoRA + continued pre-training |
| Ch 27 | Distillation mini — Teacher → Student |
| Ch 28 | Seq2seq mini — ITN |
Next → Part 8 Production Operations. Four final checkpoints to take your trained model into production.
References¶
- Vaswani et al. (2017). Attention Is All You Need. — encoder-decoder origin
- Raffel et al. (2019). T5. arXiv:1910.10683
- Xue et al. (2022). byT5. arXiv:2105.13626
- Zhang et al. (2019). Neural ITN. — neural network approach to ITN