Data Pipeline — PII · Synthetic Labels · IAA¶
What you'll learn
- PII masking — phone numbers, card numbers, national IDs, names. Regex combined with NER
- LLM synthetic labels — teacher model turns raw text into label candidates (applying Ch 27 techniques)
- Mini IAA (Inter-Annotator Agreement) — Cohen's κ on 100 samples
- Data versioning — DVC or hashing. One-to-one traceability from dataset to trained model
Prerequisites
1. Concept — Operational Data vs. Training Data¶
| Aspect | Synthetic data (Ch 5/7) | Operational data (this chapter) |
|---|---|---|
| Source | LLM synthesis | Real logs (calls, chat, etc.) |
| PII | None | Everywhere |
| Labels | Teacher auto-generates | Mix of human labels + LLM synthesis |
| Validation | Filter pass rate | Human IAA + regression |
| License | Teacher API ToS | Company data policy |
Operational data must pass three gates — legal, security, and quality — before it can be used for training.
2. PII Masking — 4 Stages¶
Stage 1. Regex (catches ~90%)¶
Stage 2. NER Model (names, addresses)¶
Reuse the NER model from Ch 25 — person names, addresses, organization names.
from transformers import pipeline
ner = pipeline("token-classification", model="ner_pii_model",
aggregation_strategy="simple")
def mask_ner(text):
for ent in ner(text):
text = text.replace(ent["word"], f"[{ent['entity_group']}]")
return text
Stage 3. LLM Validation (residual)¶
Pass the regex + NER output through an LLM once more to catch any remaining PII.
Stage 4. Human Spot Check¶
After automated processing, have a person read 100–500 samples directly.
| Stage | PII caught | Cost |
|---|---|---|
| Regex | 90% | Near zero |
| NER | +5% (names, addresses) | Low |
| LLM validation | +3–4% | Medium |
| Human spot check | +0.5–1% | High |
→ Cumulative 99%+ safety level. 100% is impossible — even after a spot check, assume some identifiable information remains and invest in data governance accordingly.
3. Synthetic Labels — Teacher Cost¶
Same approach as the synthetic data generation in Ch 27 Distillation.
| Labeling method | Cost (10K pairs) | Consistency | Capability ceiling |
|---|---|---|---|
| Human annotators | $5K–50K | △ (annotator variance) | Human ability |
| Teacher (Haiku) | $1–5 | ◎ (consistent) | Haiku ability |
| Teacher (Sonnet) | $30–50 | ◎ | Sonnet ability |
| Teacher (Opus) | $300–500 | ◎ | Opus ability |
Best value: Haiku for first-pass synthesis → review 200–500 samples with Opus or a human.
4. IAA — Measuring Label Consistency¶
Quantifies how much multiple annotators (human or LLM) agree. The standard metric is Cohen's κ.
- p_o = fraction of cases where annotators gave the same answer
- p_e = expected agreement by chance
| κ | Interpretation |
|---|---|
| 0.0–0.2 | Almost no agreement |
| 0.2–0.4 | Slight |
| 0.4–0.6 | Moderate |
| 0.6–0.8 | Good (practical threshold) |
| 0.8–1.0 | Excellent |
| iaa_kappa.py | |
|---|---|
Workflow for this book¶
- Two annotators (or Haiku + a human) label 100 items
- Measure κ
- κ < 0.6 → rewrite label definitions + label 100 more
- κ ≥ 0.6 → proceed to full-scale labeling
5. Data Versioning — One-to-One Traceability¶
Trained model → dataset used → synthesis timestamp → PII masking version. You need the full chain.
Include data_hash in the model card at training time → you can always trace which dataset produced which model.
Alternative: DVC (Data Version Control) — git-style versioning for data. Recommended for large datasets.
6. Common Failure Modes¶
- Trusting regex alone — names and addresses won't get caught. Always pair with NER.
- Skipping LLM validation — unusual PII variants (e.g., extra spaces inside a number) slip through. Run one more pass.
- Zero human spot checks — automation's last blind spot. Even 100 samples is better than none.
- Labeling before measuring IAA — starting full labeling with a κ < 0.4 definition means throwing away all the work.
- Not recording the data hash — you lose the trained model → dataset link. Painful when a recall or retraining is needed.
- 100% synthetic labels — zero human review is risky. Even 5–10% human review helps.
- Not checking the Teacher API ToS — OpenAI's ToS, for example, prohibits using API output to train competing models.
- Re-identifiable PII after masking — even with [PHONE] in place, context (name + timestamp) can still identify someone. Consider k-anonymity or other additional techniques.
7. Operational Checklist¶
Data pipeline gates:
- PII masking 4 stages (regex → NER → LLM → human spot check)
- Synthetic label cost compared against human labeling cost
- IAA κ ≥ 0.6 (label definition passes)
- Data hash + metadata file
- DVC or simple git LFS
- Legal review of Teacher API ToS
- Review remaining re-identifiable information (k-anonymity or case review)
- Pipeline automation (regex → NER → filter → metadata in one pass)
8. Exercises¶
- Apply the 4-stage PII masking from §2 to 100 raw sentences from your domain. Measure the fraction of PII caught.
- Have annotator A (you) and annotator B (Haiku) label the same 100 sentences. Measure κ.
- Find cases where κ < 0.6 — where does the disagreement come from? Rewrite the label definition.
- Write a metadata file for a synthetic 5K-pair dataset (hash + definitions + IAA).
- (Think about it) Is "99% PII masking" safe? What happens when that remaining 1% gets recovered?
References¶
- Cohen (1960). A Coefficient of Agreement for Nominal Scales. — κ definition
- HuggingFace
datasetsdata versioning - DVC (Data Version Control) docs
- "Designing Machine Learning Systems" (Chip Huyen) — data pipeline chapter