Ch 13. Advanced RAG¶
What you'll learn
- HyDE (Hypothetical Document Embeddings) — embed a hypothetical answer instead of the raw query
- Self-RAG — let the LLM judge whether retrieval is needed, and whether results are good
- GraphRAG — entity-graph-based reasoning (concept intro)
- Agentic RAG — search as a tool inside the Part 5 agent loop
- Query rewriting, multi-query, recursive retrieval
- "Advanced doesn't always mean better" — cost, latency, and complexity trade-offs
Prerequisites
Ch 12 — hybrid search, rerankers, and metadata filters feel natural.
1. Concept — when basic RAG hits its ceiling¶
The pipelines from Ch 11–12 handle 80% of queries. But a few patterns make basic RAG struggle:
| Failure mode | Example | Fix |
|---|---|---|
| Short, vague queries | "AI?" · "policy?" | HyDE · Query Rewriting |
| Queries you can answer without retrieval | "Hi there" → RAG waste | Self-RAG |
| Questions that need multiple sources stitched together | "Who's the budget owner for Team A × Project B?" | GraphRAG |
| Queries that need search → compute → search again | "This year's growth rate vs. last year?" | Agentic RAG |
Each variant solves one specific bottleneck. In practice, you combine them.
2. Why you need them — three structural limits of basic RAG¶
- Query ↔ document representation mismatch — users write short queries ("refund?") while documents are long. Embedding distance grows.
- Assumes retrieval is always needed — small talk, quick math, greetings waste retrieval budgets.
- Assumes one search is enough — complex questions need multiple lookups.
Each variant addresses one of these.
3. Where they're used¶
| Technique | When it shines | Cost |
|---|---|---|
| HyDE | Short queries · expert domains | +1 LLM call |
| Self-RAG | General chatbots · cut unnecessary searches | +1–2 LLM calls |
| GraphRAG | Multi-entity questions · summarization | Big index build |
| Agentic RAG | Complex questions · external tools | Multiple loop iterations (Part 5) |
4. HyDE — minimal example¶
The idea: short query embeddings are noisy. Generate a fake answer with an LLM, embed that longer text instead → better document matches.
Result: short queries see 20–40% better recall (depends on domain).
Trap: if the LLM makes up a wrong answer, you'll search in the wrong direction → see §6.
5. Hands-on¶
5.1 Query Rewriting / Expansion¶
HyDE's cousin: rephrase and expand the query with an LLM, then do multi-query retrieval.
5.2 Self-RAG — decide whether to search¶
Win: "Hi" doesn't waste retrieval budget. Incomplete results escalate to a human.
Cost: 2–3× more LLM calls. Adds latency and expense — gate this carefully, don't use it for every query.
5.3 Multi-step / Recursive Retrieval¶
Complex questions need first search + follow-up search cycles:
Q: "This year's growth rate vs. last year?"
Step 1: search "last year's revenue" → found $10B
Step 2: search "this year's revenue" → found $13B
Step 3: LLM calculates → 30%
Implementation is Part 5 Agent territory. Just the concept here.
5.4 GraphRAG — concept intro¶
Microsoft's GraphRAG (2024):
- Extract entities and relationships from documents (LLM-powered)
- Store as a knowledge graph (Neo4j, NetworkX)
- At query time, combine graph traversal + vector search
- Strong on multi-hop questions ("Who is X's manager's manager?")
Pros: great for structural queries.
Cons: index build cost is 10–100× basic RAG. Massive LLM calls for entity extraction.
Recommendation: only if you have a small domain (thousands of docs), need summaries, or face many multi-hop questions. General FAQ chatbots don't need this.
5.5 Agentic RAG — the bridge to Part 5¶
Agents (Part 5) have search as a callable tool and invoke it multiple times:
Agent loop:
1. receive user question
2. think "what do I need to know?"
3. call search_policy(query="refund") tool
4. see results, think "not enough"
5. call search_database(...)
6. satisfied → generate final answer
Combines Part 2 Ch 8 (Tool Calling) + Part 5 (Agent patterns). Strong for complex questions and multi-domain scenarios.
6. Pitfalls — where these break¶
Pitfall 1. Blindly adding 'Advanced' to everything
HyDE and Self-RAG add LLM calls. If basic RAG already works, you're paying more for no gain.
Fix: A/B test with Part 4 evaluation. If gain < 5%, hold off.
Pitfall 2. HyDE hallucinations point you wrong
A bad hypothetical answer steers search off course → wrong final answer.
Fix: (a) keep hypothetical answers to 2–3 sentences, (b) search with both raw query and HyDE embedding (RRF merge), (c) track hallucination rate.
Pitfall 3. Self-RAG misjudges and hallucinates
Model decides "no retrieval needed," then invents an answer (hallucination risk returns).
Fix: conservative prompt ("if any company context might help, say YES"). Or always search but only evaluate result quality.
Pitfall 4. GraphRAG indexing costs you more than you save
100,000 documents → 100,000 LLM calls for entity extraction. Easy six figures in cost.
Fix: pilot on 1,000 docs to estimate cost; scale gradually. Use a cheap fast model (Haiku).
Pitfall 5. Stacking all techniques at once
HyDE + Self-RAG + Rerank + Multi-query = 5–7 LLM calls per query, 3–5 second latency, 10× cost.
Fix: Add one at a time. Measure the actual gain. Kill it if cost > benefit.
7. Production checklist¶
- Advanced techniques chosen via A/B testing on evaluation set (Part 4)
- Dashboard tracking LLM calls, cost, latency per query
- If using HyDE, monitor hallucination rate of fake answers
- Self-RAG gate accuracy (TP/FP of "needs search" decision)
- GraphRAG index build and update costs budgeted separately
- Multi-technique sequence documented (Query rewrite → HyDE → hybrid search → rerank …)
- Agentic RAG uses Part 5 agent operations guidelines alongside this
8. Exercises¶
- Run §4's
hyde.py. Compare top-5 results to basic RAG on the same query — measure relevance ratio. - Evaluate §5.2 Self-RAG on 10 queries (half need retrieval, half are chitchat) — what's the decision accuracy?
- Try Query Rewriting on cross-language queries ("refund policy" + Korean documents) — note recall change.
- Compare HyDE + Rerank vs. Basic + Rerank — latency and quality trade-off?
- GraphRAG: document the concept, don't code it yet. Create a decision tree: is it right for my project?
9. Sources and further reading¶
- HyDE: Gao et al. (2022), "Precise Zero-Shot Dense Retrieval without Relevance Labels"
- Self-RAG: Asai et al. (2023), "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection"
- GraphRAG: Edge et al., Microsoft (2024), "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" · github.com/microsoft/graphrag
- Agentic RAG: LangChain, LlamaIndex tutorials
- Stanford CME 295 Lec 7 — in project
_research/stanford-cme295.md
Next → Ch 14. LangChain in Practice + Multimodal RAG
Assemble RAG pipelines fast with a framework, then unlock PDF layout and image-based retrieval.