Skip to content

Classification and NER Fine-tuning (Encoder)

Open in Colab

What you'll learn

  • Decoder vs Encoder — why encoders win on classification and NER
  • Korean options: KoELECTRA, klue/bert-base, xlm-roberta-base
  • Token classification head + IOB tagging
  • Mini domain entity extraction — pulling phone numbers, amounts, product names, and contract IDs from call transcripts

Prerequisites

Ch 8 Attention — the mask difference. Ch 23 Decision Tree.


Encoder NER — IOB tagging pipeline Encoder NER — IOB tagging pipeline

1. Concept — Decoder vs Encoder

Architecture Mask What token i sees Suited for
Decoder (GPT) causal 0..i (itself + past) generation
Encoder (BERT) none 0..T (entire sequence, bidirectional) classification / NER / extraction

Classification and NER require each token to see full left and right context to be accurate. Encoders are the natural fit.

A decoder can classify too (use the final hidden state → classification head), but an encoder of the same size usually wins.


2. Why Encoders Win Here

Aspect decoder (3B) encoder (110M)
Bidirectional context ×
Inference speed slow (autoregressive) fast (one forward pass)
Memory large small
Classification accuracy (same task) comparable usually 1–3% better

In production environments like call center NER: encoders are the right choice — fast, small, and accurate.


3. Korean Encoder Options

Model Parameters Notes License
klue/bert-base 110M KLUE benchmark base, Korean-focused Apache 2.0
monologg/koelectra-base-v3-discriminator 110M KoELECTRA, Korean SOTA encoder Apache 2.0
xlm-roberta-base 270M 100 languages MIT
xlm-roberta-large 550M Larger version, better NER scores MIT

Default recommendation: klue/bert-base — Korean only, small, clean.

For special domains (call center, medical), continued pre-training before fine-tuning is the right approach — but this book covers fine-tuning only.


4. Task Definition — IOB Tagging

Call transcript NER example:

Input: "Please refund 140,000 won to my mobile number 010-1234-5678"
Output:
  Token            Tag
  Please           O
  refund           O
  140,000          B-MONEY
  won              I-MONEY
  to               O
  my               O
  mobile           O
  number           O
  010              B-PHONE
  -                I-PHONE
  1234             I-PHONE
  -                I-PHONE
  5678             I-PHONE

Tag structure (BIO/IOB):

  • B- Begin (entity start)
  • I- Inside (entity continuation)
  • O Outside (not an entity)

This chapter's mini NER has 4 entity types:

Entity Example
PHONE 010-1234-5678
MONEY 140,000 won, 50,000 won
PRODUCT Galaxy S25, iPhone 16
CONTRACT Contract number KR-2026-001

5. Synthetic Data — Start with 100 Sentences

ner_synth.py
import random, anthropic, json
client = anthropic.Anthropic()

PROMPT = """Generate one customer service call sentence. Naturally include 1–2 of the following:
- Phone number (PHONE): format 010-XXXX-XXXX
- Amount (MONEY): in Korean or numeric form
- Product name (PRODUCT): Galaxy, iPhone, etc.
- Contract ID (CONTRACT): format KR-YYYY-XXX

Output format (JSON):
{"text": "...", "entities": [{"start": 0, "end": 12, "label": "PHONE"}, ...]}

Output the sentence only."""

samples = []
for i in range(100):
    msg = client.messages.create(model="claude-haiku-4-5", max_tokens=500,
                                  messages=[{"role":"user","content":PROMPT}])
    try:
        samples.append(json.loads(msg.content[0].text))
    except: pass
    if i % 20 == 0: print(f"  {i}/100")

with open("ner_train.jsonl","w") as f:
    for s in samples: f.write(json.dumps(s, ensure_ascii=False)+"\n")

100 sentences ≈ 5 minutes, about $0.05. For real use, 1,000+ is recommended.

Span → IOB Conversion

span_to_iob.py
def to_iob(text, entities, tokenizer):
    """Convert char-span annotations to token IOB labels."""
    enc = tokenizer(text, return_offsets_mapping=True)
    offsets = enc.offset_mapping
    labels = ["O"] * len(offsets)
    for ent in entities:
        first = True
        for i, (s, e) in enumerate(offsets):
            if s >= ent["start"] and e <= ent["end"]:
                labels[i] = ("B-" if first else "I-") + ent["label"]
                first = False
    return enc.input_ids, labels

6. Training with transformers Trainer

ner_train.py
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                           TrainingArguments, Trainer, DataCollatorForTokenClassification)
from datasets import load_dataset

base = "klue/bert-base"
LABELS = ["O", "B-PHONE","I-PHONE", "B-MONEY","I-MONEY",
          "B-PRODUCT","I-PRODUCT", "B-CONTRACT","I-CONTRACT"]
id2label = {i:l for i,l in enumerate(LABELS)}
label2id = {l:i for i,l in id2label.items()}

tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForTokenClassification.from_pretrained(
    base, num_labels=len(LABELS), id2label=id2label, label2id=label2id)

ds = load_dataset("json", data_files="ner_train.jsonl")["train"]

def preprocess(batch):
    enc = tok(batch["text"], truncation=True, return_offsets_mapping=True)
    labels = []
    for i, ents in enumerate(batch["entities"]):
        # IOB tagging (using the function above)
        ...
    enc["labels"] = labels
    return enc

ds = ds.map(preprocess, batched=True)
args = TrainingArguments(output_dir="ner_out", num_train_epochs=5,
                          learning_rate=3e-5, per_device_train_batch_size=16,
                          warmup_ratio=0.1, lr_scheduler_type="linear", bf16=True)
trainer = Trainer(model=model, args=args, train_dataset=ds, tokenizer=tok,
                   data_collator=DataCollatorForTokenClassification(tok))
trainer.train()
trainer.save_model("ner_out/final")

Training time: T4, 1000 pairs × 5 epochs ≈ 10 minutes.


7. Inference + F1 Evaluation

ner_eval.py
from transformers import pipeline

ner = pipeline("token-classification", model="ner_out/final",
                aggregation_strategy="simple")          # auto-merge B-/I- spans
res = ner("Please refund 140,000 won to mobile 010-1234-5678")
# [{'entity_group':'PHONE', 'word':'010-1234-5678', ...},
#  {'entity_group':'MONEY', 'word':'140,000 won', ...}]

# Entity-level F1
from seqeval.metrics import f1_score, classification_report

predictions, references = [], []
for sample in val_set:
    pred = ner(sample["text"])
    predictions.append(to_iob_tags(pred, sample["text"]))
    references.append(sample["iob_tags"])

print(f"F1: {f1_score(references, predictions):.3f}")
print(classification_report(references, predictions))

Typical results (trained on 1000 pairs):

              precision    recall  f1-score
PHONE             0.97      0.95      0.96
MONEY             0.92      0.88      0.90
PRODUCT           0.85      0.82      0.83
CONTRACT          0.94      0.93      0.93

micro avg         0.92      0.89      0.91

100 pairs → F1 around 0.7. 1,000+ pairs → F1 around 0.9.


8. Common Failure Points

1. Char-span to token IOB conversion mistakes — Use return_offsets_mapping=True to handle this automatically. Watch for WordPiece sub-word boundaries.

2. Label imbalance — When O makes up 90% of labels, learning B-/I- tags is hard. Use class weights or adjust entity ratio when generating synthetic data.

3. Learning rate too high — Encoder fine-tuning standard is 3e-5. Above 1e-4, training diverges.

4. Too few or too many epochs — 1,000 pairs × 5 epochs is a good balance. 100 pairs × 30 epochs also works (watch for overfitting).

5. Eval distribution differs from training — If you generate synthetic training data but evaluate on real logs, the domain gap will hurt. Label at least 100 real log samples separately.

6. Missing post-processing on NER output — Without aggregation_strategy="simple", you get per-subword outputs. Use pipeline as the standard.

7. Using NER where ITN is needed — "zero one zero" → "010" is a transformation, not a classification. Seq2seq (Ch 28) is the answer there.


9. Ops Checklist

NER model ops gate:

  • Define entity types (4–10)
  • 1,000+ training pairs (mix synthetic + real logs)
  • 100+ evaluation pairs (real logs)
  • F1 ≥ 0.85 (practical threshold)
  • Per-label F1 breakdown (which entity type is weakest)
  • Inference speed (single batch, p95)
  • Write model card (Ch 22's 7-item checklist)
  • (Part 8, Ch 30) Regression eval + drift monitoring

10. Exercises

  1. Define 4 entity types for your own domain, generate 100 synthetic pairs, and train klue/bert-base. Measure F1.
  2. Compare KoELECTRA vs klue/bert vs xlm-roberta on the same data. How different is the F1?
  3. Train with 100 / 500 / 1000 / 5000 samples. Plot the F1 learning curve.
  4. Compare inference speed — this book's NER model vs Qwen 2.5-0.5B LoRA for NER (p95 latency).
  5. (Think about it) Could you implement ITN as "encoder NER + post-processing rules"? What are the trade-offs vs seq2seq?

References

  • Devlin et al. (2018). BERT. arXiv:1810.04805
  • Park et al. (2020). KoELECTRA. GitHub
  • Park et al. (2021). KLUE. arXiv:2105.09680
  • Conneau et al. (2019). XLM-R. arXiv:1911.02116
  • HuggingFace seqeval — entity-level F1