Ch 13. Advanced RAG¶

What you'll learn

HyDE (Hypothetical Document Embeddings) — embed a hypothetical answer instead of the raw query
Self-RAG — let the LLM judge whether retrieval is needed, and whether results are good
GraphRAG — entity-graph-based reasoning (concept intro)
Agentic RAG — search as a tool inside the Part 5 agent loop
Sufficient Context (added 2026) — decide when to stop or abstain by "sufficiency," not "relevance"
Query rewriting, multi-query, recursive retrieval
"Advanced doesn't always mean better" — cost, latency, and complexity trade-offs

Prerequisites

Ch 12 — hybrid search, rerankers, and metadata filters feel natural.

1. Concept — when basic RAG hits its ceiling¶

The pipelines from Ch 11–12 handle 80% of queries. But a few patterns make basic RAG struggle:

Failure mode	Example	Fix
Short, vague queries	"AI?" · "policy?"	HyDE · Query Rewriting
Queries you can answer without retrieval	"Hi there" → RAG waste	Self-RAG
Questions that need multiple sources stitched together	"Who's the budget owner for Team A × Project B?"	GraphRAG
Queries that need search → compute → search again	"This year's growth rate vs. last year?"	Agentic RAG

Each variant solves one specific bottleneck. In practice, you combine them.

Advanced RAG variants

2. Why you need them — three structural limits of basic RAG¶

Query ↔ document representation mismatch — users write short queries ("refund?") while documents are long. Embedding distance grows.
Assumes retrieval is always needed — small talk, quick math, greetings waste retrieval budgets.
Assumes one search is enough — complex questions need multiple lookups.

Each variant addresses one of these.

3. Where they're used¶

Technique	When it shines	Cost
HyDE	Short queries · expert domains	+1 LLM call
Self-RAG	General chatbots · cut unnecessary searches	+1–2 LLM calls
GraphRAG	Multi-entity questions · summarization	Big index build
Agentic RAG	Complex questions · external tools	Multiple loop iterations (Part 5)

4. HyDE — minimal example¶

The idea: short query embeddings are noisy. Generate a fake answer with an LLM, embed that longer text instead → better document matches.

HyDE details

hyde.py
from anthropic import Anthropic
from openai import OpenAI

anthropic = Anthropic()
openai = OpenAI()

def hyde_search(query: str, col, k: int = 5):
    # 1) Ask LLM to generate a hypothetical answer
    r = anthropic.messages.create(
        model="claude-haiku-4-5", max_tokens=256,
        system="Write a factual answer to the question below. It doesn't need to be perfect. Keep it to 4 sentences.",
        messages=[{"role": "user", "content": query}],
    )
    hypothetical = r.content[0].text

    # 2) Embed the hypothetical answer
    emb = openai.embeddings.create(
        model="text-embedding-3-small",
        input=[hypothetical],
    ).data[0].embedding

    # 3) Search for real documents using the hypothetical embedding
    res = col.query(query_embeddings=[emb], n_results=k)
    return res

# Usage
results = hyde_search("refund policy?", col)

Result: can improve recall for short, ambiguous queries. Measure the gain on the domain eval set.

Trap: if the LLM makes up a wrong answer, you'll search in the wrong direction → see §6.

5. Hands-on¶

5.1 Query Rewriting / Expansion¶

HyDE's cousin: rephrase and expand the query with an LLM, then do multi-query retrieval.

query_expansion.py
def expand_query(q: str) -> list[str]:
    r = anthropic.messages.create(
        model="claude-haiku-4-5", max_tokens=256,
        system="Rewrite the user's query as 3 different questions with the same meaning. Return only a JSON array.",
        messages=[{"role": "user", "content": q}],
    )
    import json
    return json.loads(r.content[0].text)

# Search with each variant, then merge with RRF
variants = [query] + expand_query(query)
ranked_lists = [search(v) for v in variants]
final = rrf_merge(ranked_lists)   # Reuse Ch 12's RRF

5.2 Self-RAG — decide whether to search¶

self_rag.py
def self_rag(query: str) -> str:
    # 1) Decide: does this query need external documents?
    decision = anthropic.messages.create(
        model="claude-haiku-4-5", max_tokens=10,
        system="""Does answering this user's question require searching external documents?
Answer only YES or NO.""",
        messages=[{"role": "user", "content": query}],
    ).content[0].text.strip()

    context = ""
    if decision.startswith("YES"):
        # 2) Retrieve + assess result quality
        retrieved = search(query, k=5)
        context = format_context(retrieved)

        # 3) Self-assess: are these results good enough?
        quality = anthropic.messages.create(
            model="claude-haiku-4-5", max_tokens=10,
            system=f"""Can you answer the question with the search results below?
{context}
YES/NO""",
            messages=[{"role": "user", "content": query}],
        ).content[0].text.strip()
        if quality.startswith("NO"):
            # Fallback: rewrite query, retry, or escalate
            return "Search results aren't sufficient. I'll connect you with a specialist."

    # 4) Generate final answer (with or without context)
    r = anthropic.messages.create(
        model="claude-haiku-4-5", max_tokens=512,
        system=(context or "Answer from general knowledge."),
        messages=[{"role": "user", "content": query}],
    )
    return r.content[0].text

Win: "Hi" doesn't waste retrieval budget. Incomplete results escalate to a human.

Cost: 2–3× more LLM calls. Adds latency and expense — gate this carefully, don't use it for every query.

5.3 Multi-step / Recursive Retrieval¶

Complex questions need first search + follow-up search cycles:

Q: "This year's growth rate vs. last year?"

Step 1: search "last year's revenue" → found $10B
Step 2: search "this year's revenue" → found $13B
Step 3: LLM calculates → 30%

Implementation is Part 5 Agent territory. Just the concept here.

5.4 GraphRAG — concept intro¶

Microsoft's GraphRAG (2024):

Extract entities and relationships from documents (LLM-powered)
Store as a knowledge graph (Neo4j, NetworkX)
At query time, combine graph traversal + vector search
Strong on multi-hop questions ("Who is X's manager's manager?")

Pros: great for structural queries.
Cons: index build cost is 10–100× basic RAG. Massive LLM calls for entity extraction.

Recommendation: only if you have a small domain (thousands of docs), need summaries, or face many multi-hop questions. General FAQ chatbots don't need this.

5.5 Agentic RAG — the bridge to Part 5¶

Agents (Part 5) have search as a callable tool and invoke it multiple times:

Agent loop:
  1. receive user question
  2. think "what do I need to know?"
  3. call search_policy(query="refund") tool
  4. see results, think "not enough"
  5. call search_database(...)
  6. satisfied → generate final answer

Combines Part 2 Ch 8 (Tool Calling) + Part 5 (Agent patterns). Strong for complex questions and multi-domain scenarios.

5.6 Sufficient Context — using "is it enough?" as a stop signal¶

§5.2 Self-RAG asks the LLM, on the spot, "can you answer with these results?" Google's research (Sufficient Context, ICLR 2025) formalizes that intuition into a separate signal.

The key distinction is relevant ≠ sufficient:

Relevant — the retrieved document is on-topic
Sufficient — that document alone makes a definitive answer possible

You can retrieve relevant documents and still be missing the key fact — that's insufficient. (A generic explanation of the HTTP 404 error is relevant to "which lab housed Room 404," but not sufficient.)

The RAG paradox — context erodes abstention

Adding context makes the model overconfident, so instead of abstaining when it doesn't know, it confidently hallucinates. In one study, Claude 3.5 Sonnet's abstention rate dropped from 84% to 52% with RAG. Giving it more to read actually weakened its ability to say "I don't know" — the same root cause as the Self-RAG misjudgment in Pitfall 3.

The fix is selective generation, which combines two signals to decide whether to answer or abstain:

Self-rated confidence — sample multiple answers and let the model self-assess correctness (P(True))
Sufficient-context label — a binary signal from a sufficiency autorater. Google reached ~93% accuracy with prompted Gemini 1.5 Pro alone (no fine-tuning).

A logistic regression combines both signals to predict "hallucination probability," and abstains past a threshold. It trains without ground-truth labels and lifts the fraction of correct answers among answered questions by 2–10%.

Productized (2026): Google wired this rater into Gemini Enterprise as the stopping condition of the Agentic RAG loop (the Sufficient Context Agent). It replaces §5.5's on-the-spot "enough → answer / not enough → search again" YES/NO with a learned signal. The full paper-to-product story, plus a 4-step hands-on recipe, is covered in blog #34.

6. Pitfalls — where these break¶

Pitfall 1. Blindly adding 'Advanced' to everything

HyDE and Self-RAG add LLM calls. If basic RAG already works, you're paying more for no gain.
Fix: A/B test with Part 4 evaluation. If gain < 5%, hold off.

Pitfall 2. HyDE hallucinations point you wrong

A bad hypothetical answer steers search off course → wrong final answer.
Fix: (a) keep hypothetical answers to 2–3 sentences, (b) search with both raw query and HyDE embedding (RRF merge), (c) track hallucination rate.

Pitfall 3. Self-RAG misjudges and hallucinates

Model decides "no retrieval needed," then invents an answer (hallucination risk returns).
Fix: conservative prompt ("if any company context might help, say YES"). Or always search but only evaluate result quality.

Pitfall 4. GraphRAG indexing costs you more than you save

100,000 documents → 100,000 LLM calls for entity extraction. Easy six figures in cost.
Fix: pilot on 1,000 docs to estimate cost; scale gradually. Use a cheap fast model (Haiku).

Pitfall 5. Stacking all techniques at once

HyDE + Self-RAG + Rerank + Multi-query = 5–7 LLM calls per query, 3–5 second latency, 10× cost.
Fix: Add one at a time. Measure the actual gain. Kill it if cost > benefit.

7. Production checklist¶

Advanced techniques chosen via A/B testing on evaluation set (Part 4)
Dashboard tracking LLM calls, cost, latency per query
If using HyDE, monitor hallucination rate of fake answers
Self-RAG gate accuracy (TP/FP of "needs search" decision)
GraphRAG index build and update costs budgeted separately
Multi-technique sequence documented (Query rewrite → HyDE → hybrid search → rerank …)
Agentic RAG uses Part 5 agent operations guidelines alongside this

8. Exercises¶

Run §4's hyde.py. Compare top-5 results to basic RAG on the same query — measure relevance ratio.
Evaluate §5.2 Self-RAG on 10 queries (half need retrieval, half are chitchat) — what's the decision accuracy?
Try Query Rewriting on cross-language queries ("refund policy" + Korean documents) — note recall change.
Compare HyDE + Rerank vs. Basic + Rerank — latency and quality trade-off?
GraphRAG: document the concept, don't code it yet. Create a decision tree: is it right for my project?

9. Sources and further reading¶

HyDE: Gao et al. (2022), "Precise Zero-Shot Dense Retrieval without Relevance Labels"
Self-RAG: Asai et al. (2023), "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection"
GraphRAG: Edge et al., Microsoft (2024), "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" · github.com/microsoft/graphrag
Agentic RAG: LangChain, LlamaIndex tutorials
Sufficient Context: Joren et al. (2024), "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems" (ICLR 2025) · arXiv 2411.06037 · productized: Google Research (2026)
Stanford CME 295 Lec 7 — in project _research/stanford-cme295.md

Next → Ch 14. LangChain in Practice + Multimodal RAG
Assemble RAG pipelines fast with a framework, then unlock PDF layout and image-based retrieval.