Building a Tiny Benchmark¶
What you'll learn
- HellaSwag-tiny — a mini version of a big benchmark to measure this book's model
- Domain probes — writing 30~50 evaluation items tailored to your domain
- pass@k — does at least one out of k attempts succeed?
- LLM-as-judge — the pitfalls and proper use of automated evaluation
Prerequisites
The 5-axis evaluation from Ch 16 Beyond Perplexity. The premise that PPL alone isn't enough.
1. What benchmarks actually measure¶
PPL measures average language model loss. Benchmarks measure specific capabilities.
| Benchmark | Measures | Format |
|---|---|---|
| HellaSwag | commonsense reasoning (predicting what happens next) | 4-choice |
| MMLU | general knowledge | 4-choice |
| HumanEval | code generation | write function + pass tests |
| TriviaQA | factual knowledge | short answer |
| (your domain) | your use case | whatever fits |
This book's 10M story model won't get meaningful scores on any of those standard benchmarks — it's too small and too specialized. So we build a mini variant + domain probes instead.
2. Why you need this — complementing PPL¶
| Measurement | What it catches | This book's model |
|---|---|---|
| PPL | average token loss | 11.6 (Ch 16) |
| HellaSwag-tiny | commonsense reasoning | ? |
| Domain probe | "does it tell good stories?" | ? |
| pass@k | at least one success in multiple tries | ? |
Two models with PPL 11.6 can score 30 vs 50 on commonsense reasoning. That gap is decisive when choosing models, debugging, or tuning.
3. Three tools¶
Tool 1. Likelihood-based 4-choice (HellaSwag style)¶
Compute PPL for each option and pick the one with the lowest. No generation needed → fast evaluation.
- Shift by 1 — standard for language modeling.
- The choice with the highest logp (= lowest PPL) is the model's answer.
Tool 2. Domain probes — write them yourself¶
30 example items for this book's story model:
Evaluation:
def evaluate_probes(model, tok, probes, n=5):
results = {"correct": 0, "total": len(probes), "by_type": {}}
for p in probes:
passes = 0
for _ in range(n): # pass@n
out = generate(model, tok, p["prompt"], max_tokens=20)
if check(out, p): passes += 1
if passes > 0: results["correct"] += 1
results["by_type"].setdefault(p["type"], [0, 0])
results["by_type"][p["type"]][1] += 1
if passes > 0: results["by_type"][p["type"]][0] += 1
return results
Tool 3. pass@k¶
The standard for code evaluation. Try k times on the same problem — pass if at least one attempt succeeds.
| pass_at_k.py | |
|---|---|
Stories don't have a single "correct answer" the way code does, but probes like "natural character name" or "reasonable emotion" work the same way.
4. Minimal example — HellaSwag-tiny, 30 items¶
The real HellaSwag is designed for models trained on general English text. Here's a mini version for story models:
Expected result for this book's 10M story model:
Interpretation: - Random chance: 25% - 65%: the model can do commonsense reasoning within its training domain (TinyStories) - The real HellaSwag (10K+ items, general text) would give the same model under 30% — it's outside the training domain
5. LLM-as-judge¶
Rating 30 probes by hand takes 30 minutes. Rating 100 items every week is not sustainable. Use an LLM as judge:
LLM judge pitfalls¶
1. Self-bias — if the judge and the model being judged come from the same company, scores skew high. Use a different model family when possible.
2. Position bias — in A/B comparisons, the first option gets rated higher. Always swap randomly.
3. Length bias — longer answers score better. Check that length is balanced.
4. Cost — Haiku is about $0.0001 per call. 100 samples = $0.01. Sonnet is 10×.
5. Drift — even the same model version can give different scores at different times. Pin a model version for reproducibility.
6. Common failure points¶
1. Eval set overlaps with training set — if you use the same characters and keywords when generating synthetic data, the hold-out isn't actually held out. Separate seeds + hash verification.
2. Drawing conclusions from 30 items — statistically too weak. 95% confidence interval is about ±15%. Recommend 100~500 items.
3. Using greedy for pass@k — the whole point of pass@k is diverse attempts. Use temperature 0.7~1.0.
4. No type breakdown on probes — you can't tell what's weak. Split by type and aggregate results per type.
5. Evaluating only at the end of training — also evaluate at intermediate checkpoints (e.g., step 4K, 8K, 12K) to see when performance saturates.
6. Using Claude to judge a model trained on Claude's synthetic data — self-bias.
7. Stopping at accuracy numbers — whether it's 65% or 80%, you still need to read some model outputs directly.
7. Evaluation checklist¶
For this book's model:
- Hold-out PPL (within training distribution)
- OOD PPL (outside training distribution) — measure the gap
- HellaSwag-tiny 30~100 items — accuracy
- Domain probes 30~50 items — split by type
- pass@5 — diversity check
- 5-axis human evaluation, 50 samples (Ch 16)
- LLM-as-judge, 100 samples (Haiku, with position swap)
- Mid-training evaluation at intermediate steps — find the saturation point
8. Exercises¶
- Write 30 domain probes for your own use case. 5 categories × 6 prompts. Include expected outputs.
- Run HellaSwag-tiny 30 items on this book's 10M model and measure accuracy. How much above random chance (25%) did it score?
- Compare pass rate at
n=1(greedy) vsn=5(pass@5) on domain probes. How big is the gap? - Measure correlation (50 samples) between LLM judge (Haiku) and your own ratings. If r > 0.7, the judge is trustworthy.
- (Think about it) Probes with one correct answer vs probes with multiple valid answers — which type dominates in your domain? What kinds of domains are pass@k most meaningful for?
References¶
- Zellers et al. (2019). HellaSwag. arXiv:1905.07830
- Hendrycks et al. (2020). MMLU. arXiv:2009.03300
- Chen et al. (2021). Codex / HumanEval. arXiv:2107.03374 — pass@k
- Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench. arXiv:2306.05685
- Anthropic. Building evaluations (blog)