Ch 32. When to Fine-Tune — and When Not To¶
What you'll learn
- Why "AI project = fine-tuning" is a trap
- A 4-gate decision tree — getting stuck at any gate means holding off
- Prompt · RAG · SFT · DPO — trade-offs, cost, and timeline for each
- Problems fine-tuning solves well vs. problems it doesn't
- Data requirements — what numbers like "hundreds" or "tens of thousands" actually mean
- Six critical pitfalls (skipping RAG · underestimating labeling cost · one-shot training · no eval set · injecting facts via SFT · starting with base models)
Prerequisites
Ch 31 covered the three training stages. What we can actually touch is SFT (Ch 33) and DPO (Ch 34) — both phases where data costs exceed GPU costs.
1. Concept — The right question is "do we even need fine-tuning"¶
Fine-tuning is the last card in most LLM projects. If you assume from day one that you'll fine-tune, you're signing up for:
- Months of data labeling
- Hundreds of thousands to millions in GPU costs
- Every base model upgrade forces you to retrain everything
- No eval set means you can't detect regressions
All that expense can vanish on a problem that Prompt + RAG alone would have solved.
| Gate | Pass if | If blocked |
|---|---|---|
| ① Does Prompt + RAG solve it? | "No" proven by eval set | Stick with Prompt/RAG |
| ② Do you have a solid eval set? | 1K+ (q, gold answer) pairs | Build eval first (Ch 16) |
| ③ Is labeling cost < savings? | 1-year ROI is positive | Remodel the math |
| ④ Can you operate retraining quarterly? | Pipeline + GPU budget + owner assigned | Use API + prompt instead |
You must pass all four gates before fine-tuning is even on the table. Even then, start with a PoC (small LoRA run) to measure impact before full training.
2. Why you need it — Two ceilings on Prompt and RAG¶
If you hit either of these ceilings, SFT becomes justified.
① Tone, style, and output format aren't stable. Prompt + few-shot gets you to 80%, but 5% of cases produce a different format. In production, that 5% is expensive. SFT is your best tool here.
② Model size hits the cost wall. Claude Opus solves it; Claude Haiku doesn't. Fine-tune Haiku to fit your domain, and the long-term cost savings dwarf the SFT cost. Chapter 30's model routing becomes more aggressive.
Conversely, injecting new facts is not an SFT problem — RAG is far more efficient. To push facts through fine-tuning means retraining on every update. That's a heavy toll.
"Fine-tuning teaches how, RAG provides what." — Industry conventional wisdom
3. Where they fit — Four approaches compared¶
| Approach | Strengths | Weaknesses | Cost | Time |
|---|---|---|---|---|
| ① Prompt | Format tweaks · single task | Domain facts · large-scale tone shift | $ | mins |
| ② RAG | Domain facts · freshness · citations | Tone · format enforcement | $$ | days |
| ③ SFT (LoRA) | Tone · output format · classification | New fact injection | $$$ | weeks |
| ④ DPO/RLHF | Safety · politeness · refusal policy | Data collection · overfitting | $$$$ | months |
Try in order: ①→②→③→④. Each move to the next level requires proving that the previous level hit its ceiling—and eval sets have to show it.
By problem type¶
| Problem | First try | Then |
|---|---|---|
| "Answer in our company voice" | ① Prompt + few-shot | ③ SFT if that stalls |
| "Answer from internal wiki" | ② RAG | ③ SFT when citation precision tops out |
| "Classify legal docs into 5 buckets" | ① Prompt → ③ SFT (Haiku) | Cost optimization play |
| "Rejection tone needs new policy" | ④ DPO (small batch) | SFT won't stabilize it |
| "Real-time stock price answers" | ② RAG (with tool calls) | Never ③—facts age too fast |
| "Stable JSON output format" | ① + tool-use (Ch 6) | ③ SFT if that doesn't land |
4. Minimal example — 10-point fine-tuning readiness checklist¶
- No eval set = can't measure impact. Chapter 16 first, always.
- Skipping RAG then jumping to SFT is 90% wasted effort.
- If fact accuracy is the issue, RAG solves it — not fine-tuning.
- Cost optimization SFT (fine-tuning Haiku for your domain) is a legitimate reason.
Below 7 yes answers? Fill those in first. Starting SFT at 6 risks expensive failure.
5. Hands-on — Data volume intuition¶
| Task type | Minimum | Recommended | Good results |
|---|---|---|---|
| Classification (10 classes) | 500 | 2K | 5K |
| Tone/style transfer | 1K | 5K | 20K |
| Domain-specific chat SFT | 5K | 20K | 100K |
| General instruction tuning | 50K | 200K | 1M+ |
| DPO (preferences) | 1K | 5K | 20K |
"When in doubt, start at 5K" is your safety baseline. Less than that, noise drowns signal.
Quality beats quantity¶
| Pitfall | What happens |
|---|---|
| One person labels everything | Their biases bake into the model |
| Auto-generated examples only | The model's own weaknesses amplify |
| 5% label noise | Training stalls (fatal in small datasets) |
| Eval set leaks into training | High validation scores, poor real-world performance |
Apply the labeling guidelines from Ch 16 at the same rigor to fine-tuning data.
PoC phase — 100–500 examples first¶
Before full training, run a PoC:
- Grab 100–500 real examples
- Fine-tune with LoRA (r=8, lightweight — see Ch 33)
- Eval on your test set: if baseline + 5 points better → go full scale
- If +5 points doesn't appear → more data won't fix it; revisit your approach
+5 points varies by domain, but the signal is what matters: "Do we see movement in the PoC phase?"
Cost model — 1-year ROI¶
SFT cost = labeling + GPU + quarterly eval/retrain
≈ (5K × $1) + ($500) + ($200/month × 12) ≈ $7,900
Annual savings = (Opus cost per token - Haiku cost per token) × yearly calls
Example: ($0.030 - $0.001) × 1M calls = $29,000
ROI = 29,000 / 7,900 ≈ 3.7 → positive. (Quality must stay equal.)
If ROI is below 1.0, don't do it. Add operational overhead and you need 1.5+ to feel safe.
6. Common failure modes¶
- Jumping to SFT without trying RAG first. Fact accuracy problems are RAG's sweet spot. Baking facts into fine-tuning is expensive and brittle on every update.
- Underestimating labeling cost. Making 5K (q,a) pairs isn't trivial — one labeler at 50/day needs 100 working days. Outsourcing runs $1–5 per pair.
- One-shot training, then static. Your base model ships a new version quarterly. Your SFT needs to retrain. Without infrastructure and ownership, it rots.
- No eval set. You train, metrics look good, then production burns. Eval sets come before training data.
- Injecting facts via SFT. Once a fact lands in weights, updates are locked behind retraining. Separate facts (RAG) from behavior (SFT).
- Starting with base models instead of instruct. Instruct models already have one SFT pass. You do domain tuning on top. Base models need 100x more data.
- Skipping PoC. You label 5K examples, train for days, gain 0 points. Big loss. Always 100-sample PoC first.
- Watching only training loss, not domain metrics. Final accuracy or F1 on your eval set is truth. Loss and perplexity are supporting signals.
7. Operations checklist¶
- Documented decision for all four gates (passed or blocked)
- Eval set 1K+ ready before labeling training data
- PoC (100–500 samples) shows signal before scaling
- Cost model (1-year ROI) is positive on the numbers
- Training data has zero overlap with eval set
- Multiple labelers + written guidelines + inter-rater agreement checks
- Quarterly retraining pipeline + GPU budget + named owner
- Base vs. instruct model choice documented
- Post-training: measure both domain accuracy and safety regressions
- Rollback-ready version control for model checkpoints
8. Exercises¶
- Your internal IT helpdesk chatbot gets 60% of "internal policy" questions right. Apply the four gates — what's your first move?
- You want answers in "official company voice." Rank: Prompt few-shot, RAG, SFT. Why that order?
- SFT costs $7,900. Annual calls = 100K (10% of 1M). Calculate ROI and decide: ship or no-ship?
- Your PoC (100 samples) hit baseline +1 point instead of +5. List three next moves.
Next → Ch 33. LoRA and QLoRA in Practice You've cleared the gates. Time to get your hands dirty on the first tractable training job.
Sources¶
- Stanford CME 295 — Transformers and LLMs Lec 4
- Hu et al. (2021) LoRA: Low-Rank Adaptation of Large Language Models
- OpenAI · Anthropic — fine-tuning best practice docs
- Industry guides: data collection · labeling · ROI modeling