Why Models¶

What you'll learn

The intuition for telling rule-shaped problems apart from model-shaped problems
The three criteria OpenAI recommends checking before you reach for an LLM
A hands-on demo of where rules silently break — and what a model picks up

1. A small experiment¶

Read the five user messages below. Goal: classify each as a refund request or not.

#	Message	Refund?
1	"I want a refund."	✅
2	"Give me my money back."	✅
3	"This isn't at all what I expected. What should I do?"	✅ (implicit)
4	"What is your refund policy?"	❌ (info request)
5	"My friend told me they got a refund."	❌ (just chatter)

Question: can a one-line rule like if "refund" in message get all five right?

1, 2, 4, and 5 are doable somehow. #3 isn't — the word "refund" doesn't appear at all.
Add more rules to filter out 4 and 5 and you start writing dozens of lines, plus a fresh one every time a new phrasing shows up.

That's the smell of "this should be a model, not rules."

2. Rules vs. models — the real difference¶

Two approaches

Top: the author has to enumerate every case. Bottom: the model infers cases it has never seen.

	Rules (code)	Model (LLM)
Mental model	"If this condition, then this result"	"Given this context, what's the most likely result"
Strengths	Predictable · free · fast · auditable	Handles unstructured input · understands new phrasings without training
Weaknesses	Every new phrasing needs a new rule · maintenance cost grows	Probabilistic · cost · latency · harder to verify
How it fails	Silent miss	Confident wrong answer (hallucination)
Debug	Stack traces, logs	Prompts, examples, eval sets

3. Three criteria for reaching for a model¶

Anthropic and OpenAI's engineering guides converge on the same three triggers. Hit at least one, and a model is worth considering. Hit none, and you should write code instead.

Complex judgment¶

Decisions that rules can't fully cover — exceptions, context-sensitive calls, fuzzy boundaries.

Examples: customer-service refund approvals, insurance claim review, anomalous-transaction detection.

Hard-to-maintain rules¶

A ruleset that's so large or fragile that updates cost more than they're worth.

Examples: vendor security review checklists with hundreds of lines and dozens of edge cases.

Unstructured data¶

The work fundamentally requires interpreting natural language — pulling meaning from documents, holding conversations.

Examples: insurance claims, internal knowledge search, customer-support summarization.

The bar

Just one criterion clearly satisfied → go. None clearly met → stop. A model dropped in for the wrong reasons usually performs worse than the rules it replaced.

4. Where assistants actually pay off¶

Apply the criteria to real work:

Scenario	Why a model	Criteria
User intent classification (refund request vs policy question vs other)	Many phrasings	③ unstructured
Document QA ("which page covers our security policy?")	Document understanding	③ + ② large ruleset
Meeting summary → action items	Context-sensitive	① + ③
CS email reply drafts	Tone and situation	① + ③
Ticket routing by content	Many ambiguous cases	① + ②

Where this book starts on the technology ladder¶

Technology ladder

Cost and complexity climb left-to-right; so does what you can handle. Skipping straight to "agents" is almost always over-engineering.

5. Minimal example — watch a rule break¶

Plain Python, no installs.

rule_vs_intent.py
messages = [
    "I want a refund.",
    "Give me my money back.",
    "This isn't at all what I expected. What should I do?",  # (1)!
    "What is your refund policy?",
    "My friend told me they got a refund.",
]

KEYWORDS = ["refund", "money back", "return"]

def is_refund_request_by_rule(msg: str) -> bool:
    return any(k in msg.lower() for k in KEYWORDS)

for m in messages:
    print(f"[{is_refund_request_by_rule(m)}]  {m}")

The graveyard of rule-based intent detection. None of the keywords appear — but the meaning is clearly a refund request.

Output¶

#	Result	Message	Verdict
1	`True`	I want a refund.	✅
2	`True`	Give me my money back.	✅
3	`False`	This isn't at all what I expected. What should I do?	❌ miss
4	`True`	What is your refund policy?	❌ false positive
5	`True`	My friend told me they got a refund.	❌ false positive

One miss and two false positives out of five. Production-unviable.

The same problem with an LLM (we'll run this for real in Part 2)¶

intent_by_llm.py
from anthropic import Anthropic
client = Anthropic()

SYSTEM = """Decide if the user message is a refund request.
Answer with one word: YES or NO."""

for m in messages:
    r = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=5,
        system=SYSTEM,
        messages=[{"role": "user", "content": m}],
    )
    print(f"[{r.content[0].text.strip()}]  {m}")

Expected output:

#	Result	Message	Note
1	`YES`	I want a refund.
2	`YES`	Give me my money back.
3	`YES`	This isn't at all what I expected…	picks up the implicit intent
4	`NO`	What is your refund policy?	distinguishes info request
5	`NO`	My friend told me they got a refund.	distinguishes chatter

The point: without adding new rules, the model picks up phrasings you never wrote down.

6. Hands-on — does my work need a model?¶

A 30-minute self-diagnostic:

Write the problem in one sentence: "Given , decide ."
Collect 30 input/output examples, mixed evenly across normal, edge, and ambiguous cases.
Write the simplest possible rule — keyword matching, regex, lookup table.
How many of the 30 did it get right? Look at the distribution of mistakes, not just the average.
Bucket the failures:
A. "Adding one more rule fixes it" → keep using code
B. "There are too many phrasings to enumerate" → model candidate
C. "Even a human is unsure" → reconsider the problem definition itself
If bucket B is 30%+ of failures, a model is worth the cost.

Output

The output of this diagnostic is the seed of your eval set. Part 4 picks it up.

7. Common pitfalls¶

Mistake 1: model first, problem second

Reaching for a model "to look smart" turns a 5-line rule into a 200-line system. If none of the three criteria fire, no model.

Mistake 2: expecting the model to handle everything

Models are probabilistic. Same input, different output is allowed. Never put a model behind anything that needs a deterministic guarantee — pricing, permissions, payment processing.

Mistake 3: shipping without an eval

"The prompt looks good" ≠ "it works in production." Without the eval set from Part 4 you'll be arguing about vibes.

Mistake 4: avoiding the rule + model hybrid

Most production systems are rules + model. Filter the obvious cases with rules, send only the ambiguous ones to the model. Cheaper, faster, more stable.

8. Production checklist¶

Monthly review: are we using a model where code would do?
Reverse review: are we still using rules where a model would clearly help?
Track the hybrid ratio — what fraction of inputs is handled by rules vs the model?
Quarterly dashboard for model cost · latency · accuracy
Every new use case must clear the three-criteria checklist before it gets a model

9. Exercises¶

Pick three things in your current work that rules can't solve cleanly. For each, mark which of the three criteria applies.
Run §5's code. Record the rule's miss and false-positive counts.
Add 10 exception rules to the keyword list. Re-measure accuracy. Where does it stop helping?
From your own project, write one paragraph each on: (a) one feature that does not need a model, and (b) one feature where a model would clearly beat rules.

10. At a glance¶

Decision flow

Next → What is an LLM If you're reaching for a model, you should know how the thing actually works.