Why Models¶
What you'll learn
- The intuition for telling rule-shaped problems apart from model-shaped problems
- The three criteria OpenAI recommends checking before you reach for an LLM
- A hands-on demo of where rules silently break — and what a model picks up
1. A small experiment¶
Read the five user messages below. Goal: classify each as a refund request or not.
| # | Message | Refund? |
|---|---|---|
| 1 | "I want a refund." | ✅ |
| 2 | "Give me my money back." | ✅ |
| 3 | "This isn't at all what I expected. What should I do?" | ✅ (implicit) |
| 4 | "What is your refund policy?" | ❌ (info request) |
| 5 | "My friend told me they got a refund." | ❌ (just chatter) |
Question: can a one-line rule like if "refund" in message get all five right?
- 1, 2, 4, and 5 are doable somehow. #3 isn't — the word "refund" doesn't appear at all.
- Add more rules to filter out 4 and 5 and you start writing dozens of lines, plus a fresh one every time a new phrasing shows up.
That's the smell of "this should be a model, not rules."
2. Rules vs. models — the real difference¶
Top: the author has to enumerate every case. Bottom: the model infers cases it has never seen.
| Rules (code) | Model (LLM) | |
|---|---|---|
| Mental model | "If this condition, then this result" | "Given this context, what's the most likely result" |
| Strengths | Predictable · free · fast · auditable | Handles unstructured input · understands new phrasings without training |
| Weaknesses | Every new phrasing needs a new rule · maintenance cost grows | Probabilistic · cost · latency · harder to verify |
| How it fails | Silent miss | Confident wrong answer (hallucination) |
| Debug | Stack traces, logs | Prompts, examples, eval sets |
3. Three criteria for reaching for a model¶
Anthropic and OpenAI's engineering guides converge on the same three triggers. Hit at least one, and a model is worth considering. Hit none, and you should write code instead.
Complex judgment¶
Decisions that rules can't fully cover — exceptions, context-sensitive calls, fuzzy boundaries.
Examples: customer-service refund approvals, insurance claim review, anomalous-transaction detection.
Hard-to-maintain rules¶
A ruleset that's so large or fragile that updates cost more than they're worth.
Examples: vendor security review checklists with hundreds of lines and dozens of edge cases.
Unstructured data¶
The work fundamentally requires interpreting natural language — pulling meaning from documents, holding conversations.
Examples: insurance claims, internal knowledge search, customer-support summarization.
The bar
Just one criterion clearly satisfied → go. None clearly met → stop. A model dropped in for the wrong reasons usually performs worse than the rules it replaced.
4. Where assistants actually pay off¶
Apply the criteria to real work:
| Scenario | Why a model | Criteria |
|---|---|---|
| User intent classification (refund request vs policy question vs other) | Many phrasings | ③ unstructured |
| Document QA ("which page covers our security policy?") | Document understanding | ③ + ② large ruleset |
| Meeting summary → action items | Context-sensitive | ① + ③ |
| CS email reply drafts | Tone and situation | ① + ③ |
| Ticket routing by content | Many ambiguous cases | ① + ② |
Where this book starts on the technology ladder¶
Cost and complexity climb left-to-right; so does what you can handle. Skipping straight to "agents" is almost always over-engineering.
5. Minimal example — watch a rule break¶
Plain Python, no installs.
- The graveyard of rule-based intent detection. None of the keywords appear — but the meaning is clearly a refund request.
Output¶
| # | Result | Message | Verdict |
|---|---|---|---|
| 1 | True |
I want a refund. | ✅ |
| 2 | True |
Give me my money back. | ✅ |
| 3 | False |
This isn't at all what I expected. What should I do? | ❌ miss |
| 4 | True |
What is your refund policy? | ❌ false positive |
| 5 | True |
My friend told me they got a refund. | ❌ false positive |
One miss and two false positives out of five. Production-unviable.
The same problem with an LLM (we'll run this for real in Part 2)¶
Expected output:
| # | Result | Message | Note |
|---|---|---|---|
| 1 | YES |
I want a refund. | |
| 2 | YES |
Give me my money back. | |
| 3 | YES |
This isn't at all what I expected… | picks up the implicit intent |
| 4 | NO |
What is your refund policy? | distinguishes info request |
| 5 | NO |
My friend told me they got a refund. | distinguishes chatter |
The point: without adding new rules, the model picks up phrasings you never wrote down.
6. Hands-on — does my work need a model?¶
A 30-minute self-diagnostic:
- Write the problem in one sentence: "Given , decide ."
- Collect 30 input/output examples, mixed evenly across normal, edge, and ambiguous cases.
- Write the simplest possible rule — keyword matching, regex, lookup table.
- How many of the 30 did it get right? Look at the distribution of mistakes, not just the average.
- Bucket the failures:
- A. "Adding one more rule fixes it" → keep using code
- B. "There are too many phrasings to enumerate" → model candidate
- C. "Even a human is unsure" → reconsider the problem definition itself
- If bucket B is 30%+ of failures, a model is worth the cost.
Output
The output of this diagnostic is the seed of your eval set. Part 4 picks it up.
7. Common pitfalls¶
Mistake 1: model first, problem second
Reaching for a model "to look smart" turns a 5-line rule into a 200-line system. If none of the three criteria fire, no model.
Mistake 2: expecting the model to handle everything
Models are probabilistic. Same input, different output is allowed. Never put a model behind anything that needs a deterministic guarantee — pricing, permissions, payment processing.
Mistake 3: shipping without an eval
"The prompt looks good" ≠ "it works in production." Without the eval set from Part 4 you'll be arguing about vibes.
Mistake 4: avoiding the rule + model hybrid
Most production systems are rules + model. Filter the obvious cases with rules, send only the ambiguous ones to the model. Cheaper, faster, more stable.
8. Production checklist¶
- Monthly review: are we using a model where code would do?
- Reverse review: are we still using rules where a model would clearly help?
- Track the hybrid ratio — what fraction of inputs is handled by rules vs the model?
- Quarterly dashboard for model cost · latency · accuracy
- Every new use case must clear the three-criteria checklist before it gets a model
9. Exercises¶
- Pick three things in your current work that rules can't solve cleanly. For each, mark which of the three criteria applies.
- Run §5's code. Record the rule's miss and false-positive counts.
- Add 10 exception rules to the keyword list. Re-measure accuracy. Where does it stop helping?
- From your own project, write one paragraph each on: (a) one feature that does not need a model, and (b) one feature where a model would clearly beat rules.
10. At a glance¶
Next → What is an LLM If you're reaching for a model, you should know how the thing actually works.