The Complete RAG Guide — Retrieval-Augmented Generation

WHY RAG?

Why Do We Need RAG?

LLMs are powerful, but they have fundamental limitations that are hard to overcome on their own.

Hallucination

LLMs fabricate plausible-sounding answers even when the information is not in their training data. Confidently generating wrong answers without fact-checking is the biggest weakness of LLMs.

Knowledge Cutoff

LLMs don't know information after their training data cutoff date. They cannot access yesterday's news, latest API changes, or real-time data.

Lack of Domain Knowledge

LLMs cannot answer questions about private data such as internal company documents or specialized medical/legal knowledge. Fine-tuning is costly and time-consuming.

Opaque Sources

LLMs cannot cite their sources. They can't answer "Where did you get this information?", making it difficult to verify trustworthiness.

CONCEPT

What is RAG?

Retrieval-Augmented Generation — generation augmented by retrieval.
Before generating an answer, the LLM retrieves relevant documents and uses them as reference.

📝

Standard LLM

Closed-book exam
Relies only on memorized knowledge

VS

📖

RAG

Open-book exam
Answers while consulting reference materials

Core Pipeline

Q

Query

User question

Retrieve

Search relevant documents

Augment

Insert into prompt

Generate

LLM generates response

01

Embedding

Converting text into numerical vectors. Since "cat" and "kitten" have similar meanings, they are positioned close together in vector space. This enables semantic search rather than keyword matching.

cat

kitten

car

vehicle

puppy

02

Vector Database

A specialized DB that stores embedding vectors and performs fast similarity search. It finds and returns the document vectors closest to the query vector.

Pinecone, Weaviate, Chroma, Qdrant, pgvector

03

Semantic Search

Searching by semantic similarity rather than keyword matching. Searching for "how to get a raise" can also find "salary negotiation strategies". Cosine similarity measures the distance between vectors.

"how to get a raise"

Salary negotiation strategies 0.92

Compensation framework guide 0.85

Company benefits overview 0.61

Original Paper Lewis et al. (2020) — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"

Proposed two variants: RAG-Sequence and RAG-Token. Uses DPR (Dense Passage Retrieval) for retrieval and BART for generation. Presented at NeurIPS 2020, marking the starting point of RAG research.

EVOLUTION

The Evolution of RAG

RAG has evolved across three generations. Let's examine how each generation overcomes the limitations of its predecessor.

2020~2022

Naive RAG

A fixed pipeline. Query → Retrieve → Generate. Simple but with clear limitations.

Query → Retrieve → Generate

2023~2024

Advanced RAG

Optimizes pre/during/post retrieval. Adds query rewriting, hybrid search, reranking, and more.

Query Optimization → Hybrid Retrieve → Rerank → Generate

2024~

Modular RAG

Modular composition. Dynamically configures routers, evaluators, iterative retrieval, and more to match the task.

Router → [Dynamic Module Composition] → Self-Evaluation → Iterate/Complete

STAGE 1

Naive RAG

The most basic "Retrieve → Read" pipeline. The starting point of RAG, but with clear limitations.

Limitations of Naive RAG

01

Garbage In, Garbage Out

When retrieval quality is poor, the response degrades too. If irrelevant documents are retrieved, the LLM generates incorrect answers based on them.

02

Simplistic Chunking

Splitting documents by fixed size cuts context or mixes multiple topics into a single chunk, creating noise.

03

Indiscriminate Passing

Retrieved documents are passed to the LLM as-is even when irrelevant to the query. There is no filtering or quality assessment.

04

No Dedup/Conflict Handling

There is no logic to handle duplicate documents or contradictory information, which can confuse the LLM.

STAGE 2

Advanced RAG

Systematically addresses the limitations of Naive RAG at each stage: pre/during/post retrieval.

Query Rewriting

Transforms user questions into forms optimized for retrieval. Converts colloquial language into search queries, splits compound questions into single queries, etc.

"How do I use my vacation days?"

→

"Annual leave usage procedures and policies"

HyDE Gao et al., 2022

Hypothetical Document Embeddings — The LLM first generates a hypothetical answer, then uses its embedding for retrieval. The insight is that answers are more similar to documents than questions are.

Question → Generate hypothetical answer → Search using answer embedding

Query Expansion

Expands a single question from multiple perspectives to broaden the search scope. Techniques include Multi-Query, Sub-Question Decomposition, and more.

Advanced Chunking Strategies

Splits documents by semantic units rather than fixed sizes.

Recursive Recursively split using separator hierarchy. A strong baseline that still needs evaluation

Semantic Split at semantic shift points. High recall

Parent-Child Search with small chunks, provide context with large chunks

Sentence Window Sentence-level search + include surrounding sentences

Contextual Retrieval Anthropic, 2024

Adds document-level context as a prefix to each chunk before embedding. By prepending context like "This chunk is excerpted from the Q3 performance section of the 2024 revenue report," retrieval accuracy improves significantly.

Metadata Tagging

Attaches supplementary information such as date, source, category, and author to documents. When combined with filtering, retrieval precision increases significantly.

Hybrid Search

Combines Dense (vector) search and Sparse (BM25 keyword) search. Captures both semantic similarity and keyword precision for the most reliable performance.

Dense (Vector)

Semantic similarity based

+

Sparse (BM25)

Keyword matching based

=

Hybrid

Optimal search results

Fine-tuned Embedding

Further trains embedding models with domain-specific data. Gains vary substantially by dataset and setup, so compare against the general model on your actual domain eval set.

Multi-Index Strategy

Layers summary indexes, full-text indexes, and metadata indexes. First narrows down relevant documents via summaries, then retrieves details from full text.

Late Interaction (ColBERT)

Stores per-token embeddings and computes similarity via MaxSim. Achieves a good balance between Bi-Encoder speed and Cross-Encoder accuracy. ColPali/ColQwen extend this to multimodal.

Reranker

Re-evaluates search results using a Cross-Encoder and reorders the rankings. Since it examines question-document pairs together, it enables more precise relevance judgments.

Search Results (Before)

Doc A 0.82

Doc B 0.79

Doc C 0.77

Doc D 0.75

Rerank

Reranked (After)

Doc C 0.95

Doc A 0.88

Doc D 0.72

Doc B 0.45

Evaluate candidates such as Cohere Rerank, bge-reranker, and FlashRank using MAP or nDCG together with added latency.

Document Compression

Extracts only relevant portions from retrieved documents to save context window space. Methods like LongLLMLingua retain only key sentences.

Deduplication & Filtering

Removes duplicate documents and ensures diversity using similarity thresholds, MMR (Maximal Marginal Relevance), and other techniques.

STAGE 3

Modular RAG

A paradigm that breaks the pipeline into modular units instead of a fixed sequence, composing them dynamically to match the task.

🔀

Router

Classifies the question type and decides whether retrieval is needed and which source to use. Simple questions go directly to LLM generation.

⚖️

Judge / Critic

Evaluates whether search results are sufficient. If insufficient, triggers re-retrieval or switches to a different source.

🧠

Adaptive Retrieval

The LLM itself judges "Is retrieval needed right now?" Reduces unnecessary retrieval to improve efficiency.

🔗

Multi-Source

Dynamically selects from multiple sources: Vector DB, web search, SQL DB, APIs, and more.

🔄

Iterative Retrieval

Repeats the retrieve-generate cycle multiple times, not just once, to refine the answer.

💾

Memory

Leverages previous conversations and search history to maintain context.

Key Implementation Patterns

Self-RAG

Asai et al., ICLR 2024

The LLM uses Reflection Tokens to self-assess whether retrieval is needed, document relevance, and response quality.

Question input

→

[Retrieve] token?

Yes

Perform retrieval

→

[IsRel] Relevant?

→

Generate

→

[IsSup] Supported?

No

Generate directly

Key insight: Instead of retrieving for every question, the LLM only retrieves when needed. This reduces noise from unnecessary retrieval.

CRAG (Corrective RAG)

Yan et al., 2024

Classifies search results as Correct / Incorrect / Ambiguous and selects different correction paths accordingly.

Perform retrieval

→

Retrieval Evaluator

Correct

Refine documents, then generate

Incorrect

Fall back to web search

Ambiguous

Combine documents + web search

Results: Self-CRAG reported accuracy improvements of 19~37% on benchmarks.

Graph RAG

Microsoft, 2024

Builds a knowledge graph and performs retrieval based on entity relationships. Excels at "global questions" that connect information scattered across multiple documents.

Key point: Generates hierarchical community summaries, enabling answers to broad questions like "What are the overall trends in this industry?"

RAPTOR

Sarthi et al., ICLR 2024

Recursive tree summarization — Repeatedly clusters and summarizes document chunks to build a multi-level abstraction tree.

Overall Summary

Cluster Summary A

Cluster Summary B

Chunk 1

Chunk 2

Chunk 3

Chunk 4

Results: GPT-4 based +20% improvement on QuALITY benchmark. Detailed questions retrieve from leaf nodes; summary questions retrieve from upper nodes.

NEXT LEVEL

Agentic RAG

An AI Agent uses retrieval as a tool while running a plan-execute-reflect loop.
Highly effective for complex multi-step questions.

Multi-Agent Architecture

Specialized agents for retrieval, summarization, evaluation, and more collaborate by role. You can also deploy modality-specific expert agents (text, image, table, etc.).

Tool Use

Beyond vector search, the agent selects and uses various tools like web search, SQL queries, calculators, and code execution as needed.

Adaptive Retrieval

Not every question triggers retrieval. The agent assesses question complexity: simple questions get immediate answers, while only complex questions go through multi-step retrieval.

"What is Python?" → Answer directly

"Analyze our Q3 revenue compared to the same period last year" → Multi-step retrieval

Speculative RAG

Additional Technique

A small specialist model generates multiple drafts in parallel, and a larger generalist model verifies them. The Draft-then-Verify pattern improves both accuracy and latency.

Sufficient Context

Added 2026.06

A formalization of the Reflect → re-run if insufficient loop in the diagram above. The key question is whether the context gathered so far is enough to answer. "Relevant" and "sufficient" are not the same — you can retrieve relevant documents and still be missing the information needed for a definitive answer.

The RAG paradox: adding context makes the model overconfident, so instead of abstaining when it doesn't know, it hallucinates. In one study, Claude 3.5 Sonnet's abstention rate dropped from 84% to 52% with RAG (Sufficient Context, ICLR 2025). The fix: use Gemini 1.5 Pro as a sufficiency autorater (~93% accuracy) to answer only when context is sufficient and re-retrieve or abstain otherwise — selective generation, which Google productized as the stopping condition of the agent loop in Gemini Enterprise (2026). For a full walkthrough of the mechanism and a 4-step hands-on recipe, see blog #34.

EVALUATION

Evaluating RAG

How do we measure the quality of a RAG system? We evaluate retrieval and generation separately.

F

Faithfulness

Is the response faithful to the retrieved documents? Measures whether the answer is based solely on document content without fabrication.

Faithfulness = Claims supported by documents / Total claims

R

Answer Relevancy

Is the response relevant to the question? Evaluates whether irrelevant content is included in the answer.

Relevancy = Average similarity between generated questions from response and original question

P

Context Precision

Are the retrieved documents precise? Checks whether too many irrelevant documents are included.

Precision = Relevant documents / Total retrieved documents

C

Context Recall

Were all necessary documents retrieved without omission? Checks whether all information needed for the correct answer is included in the search results.

Recall = Retrieved relevant documents / Total relevant documents

Evaluation Tools

RAGAS

The most popular open-source evaluation framework. Automatically computes core metrics like Faithfulness, Relevancy, Context Precision/Recall.

Open Source

DeepEval

Unit-test style evaluation that integrates into CI/CD pipelines. Automates RAG evaluation just like pytest.

CI/CD Integration

TruLens

Real-time monitoring for production environments. Enables continuous quality tracking with feedback functions.

Production Monitoring

RAGBench

A benchmark with 100K+ examples across 12 domains. Used for comparing RAG performance across industries.

Benchmark

PRACTICE

Practical Guide

Recommendations and a roadmap for applying RAG systems in production.

Recommended Production Stack

Application

LangChain / LlamaIndex / Custom

LLM

Claude / GPT-4 / Gemini / Open-source

Reranker

Cohere Rerank / bge-reranker / FlashRank

Embedding

Cohere embed-v4 / OpenAI text-embedding-3 / BGE / E5

Search

Hybrid (Dense + BM25)

Vector DB

Pinecone / Weaviate / Qdrant / pgvector / Chroma

Step-by-Step Adoption Roadmap

1

MVP: Naive RAG

Build a basic pipeline. Fixed-size chunking + simple vector search + LLM generation. The goal is to quickly prove value.

2

Quality Boost: Hybrid + Reranker

Introduce BM25 + Dense hybrid search. Add a Reranker. This alone can deliver significant quality improvements.

3

Chunking Optimization

Apply Recursive/Semantic chunking. Add context to chunks with Contextual Retrieval. Metadata tagging.

4

Build an Evaluation Framework

Build an automated evaluation pipeline with RAGAS and similar tools. Measure improvement effects with quantitative metrics.

5

Advanced: Modular / Agentic

When multi-step questions and multi-source scenarios are needed. Introduce routers, iterative retrieval, and agent patterns.

Chunking Strategy Comparison

Strategy	Method	Pros	Cons	Recommended For
Fixed-size	Split by fixed token count	Simple to implement, predictable	Context breakage	Quick prototypes
Recursive	Recursively split using separator hierarchy	Strong baseline · evaluate locally	Requires separator configuration	General purpose (default recommendation)
Semantic	Split at semantic shift points	High recall	Chunks may be too small	Multi-topic documents
Parent-Child	Search with small chunks, return large chunks	Precise search + rich context	Increased index complexity	Long documents, reports
Sentence Window	Sentence-level + surrounding sentences	Sentence-level precision	Inefficient for short documents	FAQs, manuals

💡

256~512 Tokens is Optimal

Chunk sizes of 256~512 tokens are the most stable. Too small means insufficient context, too large increases noise.

🎯

Optimize Retrieval First

Before switching LLMs, diagnose retrieval quality. Hold the model fixed and evaluate retrieval changes on your own questions.

📊

No Measurement, No Improvement

Build the evaluation pipeline first, and compare metrics with every change. Make decisions based on data, not intuition.

🔄

RAG is a Product

It's not a one-time build. Set KPIs and continuously improve. Update documents, swap models, tune pipelines.

REFERENCES

Key Papers Timeline

A chronological overview of the essential papers in RAG research.

2020

RAG: Retrieval-Augmented Generation

Lewis et al. — NeurIPS 2020

Combines DPR + BART. Proposed two variants: RAG-Sequence and RAG-Token. The starting point of RAG research.

2020

Dense Passage Retrieval (DPR)

Karpukhin et al. — EMNLP 2020

The standard for dense vector-based document retrieval. Achieved semantic search surpassing traditional TF-IDF/BM25.

2022

HyDE: Hypothetical Document Embeddings

Gao et al., 2022

Generates a hypothetical answer instead of using the question directly for retrieval. Achieves fine-tuned model-level retrieval performance even in zero-shot settings.

2023

Self-RAG: Self-Reflective RAG

Asai et al. — ICLR 2024 (Oral)

Uses Reflection Tokens for the LLM to self-assess retrieval necessity, document relevance, and response quality.

2024

CRAG: Corrective RAG

Yan et al., 2024

Classifies retrieval results as Correct/Incorrect/Ambiguous and selects correction paths. 19~37% accuracy improvement.

2024

RAPTOR: Recursive Abstractive Processing

Sarthi et al. — ICLR 2024

Builds a multi-level abstraction tree through recursive clustering + summarization. +20% improvement on QuALITY.

2024

GraphRAG

Microsoft Research, 2024

Knowledge graph + hierarchical community summaries. Excels at global questions spanning multiple documents.

2024

Contextual Retrieval

Anthropic, 2024

Adds document-level context as a prefix to chunks. Combined with BM25, reduces retrieval failure rate by 67%.

2024

Adaptive RAG

Jeong et al., 2024

Dynamically selects No Retrieval / Single-step / Multi-step based on question complexity.

2024

Late Chunking

Jina AI, 2024

Embeds the full document first, then extracts chunks from token vectors. Preserves context without additional LLM cost. API support in jina-v3.

2024

RAFT (Retrieval Augmented Fine-Tuning)

UC Berkeley, 2024

Combines RAG + Fine-tuning. Trains the model to ignore distractor documents. Outperforms pure RAG in specialized domains.

2025

SimRAG (Self-Improving RAG)

NAACL 2025

Generates self-training QA pairs from unlabeled corpora. Domain adaptation without labeling costs. Validated on 11 datasets.

2025

Sufficient Context: A New Lens on RAG

Joren et al. — ICLR 2025

Reframes retrieval quality as "sufficiency" rather than "relevance." Proposes an autorater (Gemini 1.5 Pro, ~93% accuracy) that judges whether context is sufficient, plus selective generation that abstains when it isn't. Shows the paradox that RAG can erode a model's ability to abstain.

2025

MCP (Model Context Protocol)

Anthropic, 2025

An open standard protocol connecting AI models with external tools and data. Standardizes tool integration for Agentic RAG.

2026

Context Engineering & Knowledge Runtime

Industry Paradigm Shift

The paradigm expands from RAG to Context Engineering. The Knowledge Runtime concept emerges, integrating RAG (static knowledge) + Memory (dynamic history) + MCP (tool connectivity).

Why Do We Need RAG?

Hallucination

Knowledge Cutoff

Lack of Domain Knowledge

Opaque Sources

RAG = The Solution to All These Problems

What is RAG?

Standard LLM

RAG

Core Pipeline

Embedding

Vector Database

Semantic Search

The Evolution of RAG

Naive RAG

Advanced RAG

Modular RAG

Naive RAG

Limitations of Naive RAG

Garbage In, Garbage Out

Simplistic Chunking

Indiscriminate Passing

No Dedup/Conflict Handling

Advanced RAG

Query Rewriting

HyDE Gao et al., 2022

Query Expansion

Advanced Chunking Strategies

Contextual Retrieval Anthropic, 2024

Metadata Tagging

Hybrid Search

Fine-tuned Embedding

Multi-Index Strategy

Late Interaction (ColBERT)

Reranker

Document Compression

Deduplication & Filtering

Modular RAG

Router

Judge / Critic

Adaptive Retrieval

Multi-Source

Iterative Retrieval

Memory

Key Implementation Patterns

Self-RAG

CRAG (Corrective RAG)

Graph RAG

RAPTOR

Agentic RAG

Multi-Agent Architecture

Tool Use

Adaptive Retrieval

Speculative RAG

Sufficient Context

Evaluating RAG

Faithfulness

Answer Relevancy

Context Precision

Context Recall

Evaluation Tools

RAGAS

DeepEval

TruLens

RAGBench

Practical Guide

Recommended Production Stack

Step-by-Step Adoption Roadmap

MVP: Naive RAG

Quality Boost: Hybrid + Reranker

Chunking Optimization

Build an Evaluation Framework

Advanced: Modular / Agentic

Chunking Strategy Comparison

256~512 Tokens is Optimal

Optimize Retrieval First

No Measurement, No Improvement

RAG is a Product

Key Papers Timeline

RAG: Retrieval-Augmented Generation