Monitoring · Feedback Loop · Cost¶

What you'll learn

Quality signals — refusal rate, re-ask rate, hallucination, self-consistency
Drift — detecting shifts in input distribution (KL divergence)
Feedback loop — user corrections flowing into the next training cycle
Cost model — GPU time / API tokens / labeling / license / 1-year ROI
Rollback strategy — adapter swap in under 30 seconds

Prerequisites

Ch 30 Regression Eval, Ch 31 Serving. This chapter is the final gate of production operations.

Quarterly production cycle

1. Concept — The Production Cycle¶

Deploy → Monitor → Detect signal → Collect data → Train → Evaluate → Deploy
                                                                ↑ ── quarterly (3 months) ── ↓

One cycle per quarter. Each stage driven by automated signals.

2. Quality Signals — 4 Types¶

(1) Refusal rate¶

The fraction of responses where the model said "I don't know" or refused to answer. A sudden spike means alignment is broken or OOD input has increased.

def is_refusal(text):
    keywords = ["잘 모르겠", "도와드릴 수 없", "I don't know", "I cannot"]
    return any(k in text for k in keywords)

refusal_rate = sum(is_refusal(r) for r in responses) / len(responses)

(2) Re-ask rate¶

The same user submits a similar question again within 30 seconds — a signal that the previous answer was unsatisfactory.

(3) Hallucination signal¶

Self-consistency — ask the same question 5 times. If the answers diverge, something is wrong.

def self_consistency(model, q, k=5):
    answers = [run(model, q, temperature=0.7) for _ in range(k)]
    similarities = [...]    # pairwise BERT score, etc.
    return mean(similarities)    # low score → hallucination risk

(4) User satisfaction (thumbs up/down)¶

Explicit feedback in the UI. The strongest signal, but response rates are only 1–5%, so it's noisy.

3. Drift — Input Distribution Shift¶

The gap between what the model saw during training and what it sees in production.

drift.py
from collections import Counter
import math

def drift_kl(train_tokens, prod_tokens, vocab_size):
    """KL divergence — compare token frequencies."""
    train_counts = Counter(train_tokens)
    prod_counts = Counter(prod_tokens)
    train_total = sum(train_counts.values())
    prod_total = sum(prod_counts.values())
    kl = 0
    for tok in set(train_counts) | set(prod_counts):
        p = (prod_counts[tok] + 1) / (prod_total + vocab_size)
        q = (train_counts[tok] + 1) / (train_total + vocab_size)
        if p > 0: kl += p * math.log(p / q)
    return kl

KL	Interpretation
< 0.1	Distributions match
0.1–0.5	Slight shift
> 0.5	Drift — consider retraining

The fairy-tale model in this book drifts slowly (few new words). A domain like a call center drifts fast — new products and policy changes shift the distribution quickly.

4. Feedback Loop — Production → Training¶

User corrections and refusals feed into the next training cycle.

Production logs → extract refusals, re-asks, thumbs-down cases
        ↓
PII masking (Ch 29) + IAA validation
        ↓
Added to next cycle's training data + regression suite

Quarterly cycle (3 months):

Timing	Work
Week 1	New model canary (Ch 30)
Week 2	A/B ramp to 100%
Month 1	Monitoring + feedback collection
Month 2	Label refusal/re-ask cases + IAA
Month 3	New training data + train + evaluate
(Next cycle week 1)	New model canary

5. Cost Model — 1-Year ROI¶

Hypothetical operating cost for this book's capstone model (Qwen 0.5B + LoRA):

Item	Cost (monthly)
GPU serving (T4 via vLLM)	$200
Quarterly retraining (Colab Pro)	$50 / quarter
Labeling + synthesis (Haiku)	$50
License	$0 (Apache)
Monitoring infra (Grafana, etc.)	$50
Total	~$300/month = $3,600/year

Compare that to calling the Claude Sonnet API for the same workload:

100,000 calls/month × avg 1,500 tokens × $3/M = $450/month → $5,400/year

→ Self-hosted saves $1,800/year. The one-time training cost (time through the capstone) is separate.

A positive ROI depends on high traffic, PII requirements (you can't send data to an external API), or latency needs (sub-100ms rules out API round-trips).

6. Rollback Strategy — Under 30 Seconds¶

When a production problem is detected:

Action	Time	Impact
Adapter swap (LoRA)	30 s	Zero downtime
Container restart	1 min	Brief 5xx
Base model replacement	10 min	Short downtime

The value of keeping LoRA adapters separate — the decision you made in Ch 24 and Ch 26 pays off here.

rollback.py
# Roll back from adapter v2 to v1
client.set_lora("adapter_v1")        # vLLM API
# Or reroute all traffic to a different adapter

7. Common Failure Modes¶

Deploying without monitoring — users close the tab and there's no signal. You need active measurement.
Not measuring drift — perplexity suddenly explodes 6 months in. Measure drift weekly.
No automated feedback loop — a person reviewing logs every week will eventually stop. Automate extraction + re-run IAA.
Not tracking costs — assuming self-hosted is cheaper than an API. In reality: GPU cost + training cost + labeling cost.
No rollback strategy — a bad new model means 30 minutes of downtime for a base model redeployment. Adapter swap: 30 seconds.
Quarterly cycle stretched to 6 months — you can't keep up with market changes. 3 months is the standard.
Judging hallucination by self-consistency alone — the model can be consistently wrong. You still need factual verification.

8. Operational Checklist — Year-End Graduation Gate¶

After completing this book, your model has all of the following:

Dashboard for 4 signals: refusal rate, re-ask rate, hallucination, thumbs up/down
KL drift measured weekly
Automated feedback loop (logs → next training data)
1-year cost model + ROI calculation
30-second rollback (adapter swap)
One complete quarterly production cycle
New model canary + A/B (Ch 30) + ramp
Regression / OOD / adversarial CI automation

9. Exercises¶

Simulate one week of production logs for your model. Build a dashboard of the 4 signals.
Measure KL divergence between your training data and simulated production input.
Automate the feedback loop — take 100 refusal cases and add them to the next training dataset.
Write a 1-year cost model for your domain (self-hosted vs. API).
(Think about it) Is "retrain every quarter" always right? What reasons might you have to skip a retraining cycle?

Part 8 Wrap-up¶

Chapter	What it covers
Ch 29	Data pipeline (PII, synthetic labels, IAA)
Ch 30	Regression, OOD, adversarial, A/B
Ch 31	Serving (llama.cpp / vLLM / latency budget)
Ch 32	Monitoring, feedback loop, cost, rollback

Book Graduation — 32 Chapters Complete¶

Reaching this point means:

Built a 10M SLM from scratch (Parts 1–6)
Picked an existing sLLM and applied LoRA (Part 7)
Capable of production operations: PII, evaluation, serving, monitoring (Part 8)
Published your model to HuggingFace Hub (Capstone)

The original question — "why do open-weight models come in different sizes?" — you now have the answer. You built the reasons yourself.

Next → Capstone. Run the full cycle one more time and put your own model out into the world.

References¶

"Designing Machine Learning Systems" (Chip Huyen) — monitoring and feedback loop chapter
Anthropic / OpenAI production eval patterns (blog posts)
Google SRE — progressive rollout and rollback
HuggingFace evaluate · Langfuse · Weights & Biases