Curriculum¶

AI Assistant Engineering — building, evaluating, and operating an AI assistant that reasons, calls tools, and improves itself. From beginner to enterprise production.

What this book stitches together

Stanford CME 295 (LLM theory) + Stanford CS329A (Self-Improving Agents research frontier) + Anthropic / OpenAI / LangGraph engineering guides — woven into one reading order. Each chapter ends with primary-source links. This is not a recap; it's the navigation.

1. Roadmap¶

7 parts + capstone

Why evaluation comes before agents

Part 4 (Evaluation) is placed before Part 5 (Agents) on purpose. It tracks Anthropic's recommendation: "start with simple prompts, optimize them with evaluation, and only escalate to agents when simpler solutions fall short."

2. Full table of contents (34 chapters)¶

Part 1 — Foundations (3 chapters)¶

#	Chapter	Core	Reference
1	Why models	Rules vs. models. OpenAI's 3-criteria test (complex decisions, brittle rules, unstructured data)	OpenAI Practical Guide
2	What is an LLM	Tokens, context, next-token prediction, hallucinations	CME 295 Lec 1–3
3	Assistant system overview	Input → understand → retrieve → generate → verify → store → monitor → human handoff	—

Deliverable: glossary · code-vs-model decision table · assistant block diagram

Part 2 — Python & API (5 chapters)¶

#	Chapter	Core
4	Getting started with the API	Basic calls, system/user, error handling, retries
5	Prompts + CoT basics	Roles, few-shot, Chain-of-Thought, "I don't know"
6	Structured output	JSON Schema · Pydantic · validation · fallback
7	Streaming & UX	Token streams · partial render · cancel · timeout
8	Tool Calling basics	Function calling · param generation · safe execution

Deliverable: Python sample collection · structured-output PoC · first tool-calling example

Part 3 — RAG (6 chapters)¶

#	Chapter	Core
9	Why RAG	Freshness · grounding · how it differs from fine-tuning
10	Embeddings & vector search	Cosine · MMR · vector DB role
11	RAG pipeline	Ingest · chunk · embed · retrieve · generate · cite
12	Retrieval quality	Chunk size · top-k · metadata filter · hybrid (BM25 + dense) · reranking
13	Advanced RAG	HyDE · Self-RAG · GraphRAG · Agentic RAG
14	LangChain + multimodal RAG	Retriever · chain · prompt template. PDF layout & vision embeddings

Deliverable: document QA RAG PoC · retrieval-failure analysis · pipeline diagram

Part 4 — Evaluation, Reasoning, Debugging (5 chapters)¶

#	Chapter	Core
15	What to evaluate	Retrieval · generation · end-to-end · offline vs online
16	Building an eval set	Gold set · edge cases · coverage · classification
17	LLM-as-a-Judge	Judge model design · biases and calibration · human calibration
18	Reasoning quality	CoT in depth · Self-Consistency · Best-of-N · Verifier models
19	Failure analysis	Separating prompt / retrieval / data / ranking / generation / tool failures

Deliverable: eval criteria doc · eval-set draft · failure-analysis report

Part 5 — Agents & LangGraph (6 chapters)¶

#	Chapter	Core
20	What is an agent	Model · Tool · Instruction (OpenAI's three pieces). LLM app vs agent
21	Agent patterns	Anthropic's 5 patterns (chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) + OpenAI's 2 patterns (manager, decentralized)
22	Tool Use in practice	Data · Action · Orchestration tools · ACI (Agent-Computer Interface) design · approval gates
23	LangGraph — state graphs	StateGraph · node · edge · conditional edge · reducer · checkpointer · interrupt
24	Agent memory	Thread-scoped vs cross-thread store · MemGPT · episodic · KV cache
25	Multi-agent	Planner/executor · researcher/writer · verifier/responder · the cost of over-decomposition

Deliverable: tool-using assistant PoC · LangGraph flow diagram · agent failure scenarios

Part 6 — Production (5 chapters)¶

#	Chapter	Core
26	Production architecture	Request flow · model/retrieval split · session/memory · sync/async · rate limits
27	Observability	Logging · tracing · prompt/dataset versioning · latency/cost/quality metrics · LangSmith / Langfuse
28	Seven guardrails	Relevance · safety · PII · moderation · tool · rules-based · output validation (OpenAI's table)
29	Human-in-the-loop	Failure thresholds · high-risk actions · escalation · audit logs
30	Cost & latency	Prompt caching · model routing (Haiku ↔ Sonnet ↔ Opus) · Batch API · context compression

Deliverable: production architecture doc · observability metric table · safety guide · cost simulator

Part 7 — Models & Fine-tuning (4 chapters)¶

#	Chapter	Core
31	Model architecture	Transformer · attention · instruction tuning · base vs chat · open vs hosted
32	When to fine-tune	Prompt / RAG / structured output come first. Data quantity and quality requirements
33	LoRA / QLoRA in practice	PEFT · QLoRA · data format · training loop
34	Small models, distillation, DPO	Latency / cost · distillation · DPO (in the SFT → DPO → RLHF context)

Deliverable: fine-tuning needs assessment · Colab notebook · small-model deployment ideas

Capstone — Self-Improving Assistant¶

User feedback logs → automatic failure classification → convert to DPO data → weekly retraining loop. A miniature of the CS329A final project.

Deliverables: problem statement · architecture · data composition · Prompt/RAG/Agent strategy · evaluation results · failure analysis · ops considerations · self-improvement loop design

3. Reference map¶

University coursesVendor engineering guidesRecent papers (joined where used)

CS329A — Self-Improving AI Agents (Stanford) — agent research frontier
CME 295 — Transformers & LLMs (Stanford) — theoretical backbone
CS329T — Building and Evaluating Agentic Systems (Stanford) — project-based eval discipline

Anthropic Building Effective Agents — 5 patterns
OpenAI A Practical Guide to Building Agents (PDF) — 3 components, 7 guardrails
LangGraph official docs — StateGraph · checkpointer · memory
Claude Cookbook · OpenAI Cookbook · LangSmith / Langfuse tutorials

RAG: Self-RAG · HyDE · GraphRAG
Reasoning: Chain-of-Thought · Self-Consistency · Tree of Thoughts · Let's Verify Step by Step · Archon
Alignment: InstructGPT (RLHF) · DPO · Constitutional AI
Agents: ReAct · Reflexion · Voyager · SWE-agent · MemGPT · CodeMonkeys
Efficient FT: LoRA · QLoRA

Primary-source first

This book doesn't paraphrase summaries. Each chapter ends with primary-source links — the book is "in what order, and how to read" — the navigation, not the destination.

4. Stanford course mapping¶

CME 295	Our chapter
Lec 1–3 (Transformer · LLM)	Part 1 Ch 2, Part 7 Ch 31
Lec 3 (prompting / ICL)	Part 2 Ch 5
Lec 4 (training, quantization, LoRA)	Part 7 Ch 33
Lec 5 (tuning, RLHF, DPO)	Part 7 Ch 34
Lec 6 (reasoning)	Part 4 Ch 18
Lec 7 (RAG, function calling, ReAct)	Part 3 · Part 5
Lec 8 (LLM-as-a-Judge)	Part 4 Ch 17

CS329A	Our chapter
Lec 2–3 (test-time compute, verification)	Part 4 Ch 18
Lec 4–5 (ReAct, multi-step)	Part 5 Ch 20–22
Lec 14 (memory)	Part 5 Ch 24
Lec 13 (SWE agents, CodeMonkeys)	Part 5 Ch 25 + Capstone
Lec 17 (long-horizon eval)	Part 4 Ch 15
Lec 7 (self-evolution) · Final project	Capstone

5. Prerequisites¶

Python¶

Functions, classes, async basics. Virtual environments and pip.

Shell¶

Common commands and environment variables. Skip if you only use Colab.

Reading math¶

Matrix multiplication, probability, softmax. Part 1 covers what you need.

ML basics¶

Helpful, not required.

6. What you'll be able to do¶

A self-check:

7. Suggested grading split¶

Weekly assignments 30% · hands-on PoCs 30% · mid-course design review 15% · final project 25%

Pass tiers:

Beginner — can sketch the structure and explain the terms
Basic — submits RAG · structured output · eval set
Advanced — agent + ops design + guardrails + improvement report
Enterprise — finishes the capstone (Self-Improving Assistant)

8. 14-week schedule¶

Week	Content
1	Ch 1–2: why models · LLM basics
2	Ch 3, 4: assistant structure · first API call
3	Ch 5, 6: prompts + CoT · structured output
4	Ch 7, 8: streaming/UX · tool calling
5	Ch 9, 10: why RAG · embeddings/vector search
6	Ch 11, 12: pipeline · retrieval quality
7	Ch 13, 14: Advanced RAG · LangChain + multimodal
8	Ch 15, 16: eval criteria · eval set
9	Ch 17, 18: LLM-as-Judge · reasoning quality
10	Ch 19, 20: failure analysis · what is an agent
11	Ch 21, 22: agent patterns · tool use
12	Ch 23, 24: LangGraph · agent memory
13	Ch 25, 26, 27: multi-agent · production architecture · observability
14	Ch 28, 29, 30 · Part 7 overview · capstone review

Part 7 (fine-tuning) is treated as a deep-dive option. The 14-week course covers Parts 1–6; budget 16–18 weeks for all 34 chapters.

9. Priority guide¶

Must do first — code vs models · LLM basics · structured output · RAG · evaluation · agent patterns · guardrails

Then — agent depth · production architecture · observability · cost/latency · agent memory

Later — fine-tuning · distillation · small models

10. One sentence¶

Don't go deep into models first. Learn the blocks of a real assistant and design clearly across Prompt / RAG / Agent / Evaluation / Operations / Guardrails.

Start Part 1 How this book works