Retrieval-Augmented Generation

RAG
Complete Guide

The technique that lets LLMs take an open-book exam.
By augmenting knowledge through retrieval, it produces more accurate and trustworthy AI responses.

1,200+ RAG papers published in 2024
50%+ Accuracy gain from retrieval optimization alone
3 Gens Naive → Advanced → Modular
Get Started Expert Guide →

Why Do We Need RAG?

LLMs are powerful, but they have fundamental limitations that are hard to overcome on their own.

Hallucination

LLMs fabricate plausible-sounding answers even when the information is not in their training data. Confidently generating wrong answers without fact-checking is the biggest weakness of LLMs.

Knowledge Cutoff

LLMs don't know information after their training data cutoff date. They cannot access yesterday's news, latest API changes, or real-time data.

Lack of Domain Knowledge

LLMs cannot answer questions about private data such as internal company documents or specialized medical/legal knowledge. Fine-tuning is costly and time-consuming.

Opaque Sources

LLMs cannot cite their sources. They can't answer "Where did you get this information?", making it difficult to verify trustworthiness.

RAG = The Solution to All These Problems

By retrieving external knowledge in real time and feeding it to the LLM, hallucinations decrease, the latest information is reflected, and sources can be provided.

What is RAG?

Retrieval-Augmented Generation — generation augmented by retrieval.
Before generating an answer, the LLM retrieves relevant documents and uses them as reference.

📝

Standard LLM

Closed-book exam
Relies only on memorized knowledge

VS
📖

RAG

Open-book exam
Answers while consulting reference materials

Core Pipeline

Q
Query
User question
Retrieve
Search relevant documents
Augment
Insert into prompt
Generate
LLM generates response
01

Embedding

Converting text into numerical vectors. Since "cat" and "kitten" have similar meanings, they are positioned close together in vector space. This enables semantic search rather than keyword matching.

cat
kitten
car
vehicle
puppy
02

Vector Database

A specialized DB that stores embedding vectors and performs fast similarity search. It finds and returns the document vectors closest to the query vector.

Pinecone, Weaviate, Chroma, Qdrant, pgvector
03

Semantic Search

Searching by semantic similarity rather than keyword matching. Searching for "how to get a raise" can also find "salary negotiation strategies". Cosine similarity measures the distance between vectors.

"how to get a raise"
Salary negotiation strategies 0.92
Compensation framework guide 0.85
Company benefits overview 0.61
Original Paper Lewis et al. (2020) — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"

Proposed two variants: RAG-Sequence and RAG-Token. Uses DPR (Dense Passage Retrieval) for retrieval and BART for generation. Presented at NeurIPS 2020, marking the starting point of RAG research.

The Evolution of RAG

RAG has evolved across three generations. Let's examine how each generation overcomes the limitations of its predecessor.

2020~2022

Naive RAG

A fixed pipeline. Query → Retrieve → Generate. Simple but with clear limitations.

Query → Retrieve → Generate
2023~2024

Advanced RAG

Optimizes pre/during/post retrieval. Adds query rewriting, hybrid search, reranking, and more.

Query Optimization → Hybrid Retrieve → Rerank → Generate
2024~

Modular RAG

Modular composition. Dynamically configures routers, evaluators, iterative retrieval, and more to match the task.

Router → [Dynamic Module Composition] → Self-Evaluation → Iterate/Complete

Naive RAG

The most basic "Retrieve → Read" pipeline. The starting point of RAG, but with clear limitations.

INDEXING PHASE (Offline) QUERY PHASE (Online) Documents PDF, DB, Web, API Chunking Fixed-size splitting Chunk 1 Chunk 2 Chunk N Embedding Model Text → Vector conversion [0.12, -0.45, 0.78, 0.33, -0.91, ...] Vector DB ? User Query User question input Query Embedding Convert query to vector Similarity Search Cosine similarity Top-K Vector search Prompt + Context Combine query + retrieved docs LLM Generate response Response

Limitations of Naive RAG

01

Garbage In, Garbage Out

When retrieval quality is poor, the response degrades too. If irrelevant documents are retrieved, the LLM generates incorrect answers based on them.

02

Simplistic Chunking

Splitting documents by fixed size cuts context or mixes multiple topics into a single chunk, creating noise.

03

Indiscriminate Passing

Retrieved documents are passed to the LLM as-is even when irrelevant to the query. There is no filtering or quality assessment.

04

No Dedup/Conflict Handling

There is no logic to handle duplicate documents or contradictory information, which can confuse the LLM.

Advanced RAG

Systematically addresses the limitations of Naive RAG at each stage: pre/during/post retrieval.

PRE-RETRIEVAL RETRIEVAL POST-RETRIEVAL GENERATE User Query Original question Query Rewriting Optimize / decompose query HyDE Multi-Query Optimized Query Hybrid Search Dense (Vector) Semantic similarity Sparse (BM25) Keyword matching RRF Fusion Top-K Documents Vector DB + BM25 Index Multi-index strategy Post-Processing Reranker Cross-Encoder reranking Compression Extract relevant parts only Dedup & Filter Remove duplicates, MMR Refined Context Prompt Q + Context LLM Generate Response Query optimization Hybrid search Post-processing (Rerank + Compress)

Query Rewriting

Transforms user questions into forms optimized for retrieval. Converts colloquial language into search queries, splits compound questions into single queries, etc.

"How do I use my vacation days?"
"Annual leave usage procedures and policies"

HyDE Gao et al., 2022

Hypothetical Document Embeddings — The LLM first generates a hypothetical answer, then uses its embedding for retrieval. The insight is that answers are more similar to documents than questions are.

Question Generate hypothetical answer Search using answer embedding

Query Expansion

Expands a single question from multiple perspectives to broaden the search scope. Techniques include Multi-Query, Sub-Question Decomposition, and more.

Advanced Chunking Strategies

Splits documents by semantic units rather than fixed sizes.

Recursive Recursively split using separator hierarchy. Most stable (69% win rate)
Semantic Split at semantic shift points. High recall
Parent-Child Search with small chunks, provide context with large chunks
Sentence Window Sentence-level search + include surrounding sentences

Contextual Retrieval Anthropic, 2024

Adds document-level context as a prefix to each chunk before embedding. By prepending context like "This chunk is excerpted from the Q3 performance section of the 2024 revenue report," retrieval accuracy improves significantly.

Metadata Tagging

Attaches supplementary information such as date, source, category, and author to documents. When combined with filtering, retrieval precision increases significantly.

Fine-tuned Embedding

Further trains embedding models with domain-specific data. Reports show 12~30% performance improvement over general models. Highly effective in specialized domains like medical, legal, and finance.

Multi-Index Strategy

Layers summary indexes, full-text indexes, and metadata indexes. First narrows down relevant documents via summaries, then retrieves details from full text.

Late Interaction (ColBERT)

Stores per-token embeddings and computes similarity via MaxSim. Achieves a good balance between Bi-Encoder speed and Cross-Encoder accuracy. ColPali/ColQwen extend this to multimodal.

Document Compression

Extracts only relevant portions from retrieved documents to save context window space. Methods like LongLLMLingua retain only key sentences.

Deduplication & Filtering

Removes duplicate documents and ensures diversity using similarity thresholds, MMR (Maximal Marginal Relevance), and other techniques.

Modular RAG

A paradigm that breaks the pipeline into modular units instead of a fixed sequence, composing them dynamically to match the task.

Query User question Router Classify question type Determine path Direct LLM No retrieval needed → generate directly Simple query Retrieval Multi-Source search Vector / Web / SQL Judge Sufficient? Quality assessment Pass Insufficient → re-retrieve Web Search Auxiliary source Ambiguous Generate LLM response generation + attach sources Self Eval Quality OK? Response Regenerate Complex query Iterative Retrieval + Memory Multi-step retrieve-generate loop Pass path Retry loop Correction path
🔀

Router

Classifies the question type and decides whether retrieval is needed and which source to use. Simple questions go directly to LLM generation.

⚖️

Judge / Critic

Evaluates whether search results are sufficient. If insufficient, triggers re-retrieval or switches to a different source.

🧠

Adaptive Retrieval

The LLM itself judges "Is retrieval needed right now?" Reduces unnecessary retrieval to improve efficiency.

🔗

Multi-Source

Dynamically selects from multiple sources: Vector DB, web search, SQL DB, APIs, and more.

🔄

Iterative Retrieval

Repeats the retrieve-generate cycle multiple times, not just once, to refine the answer.

💾

Memory

Leverages previous conversations and search history to maintain context.

Key Implementation Patterns

Self-RAG

Asai et al., ICLR 2024

The LLM uses Reflection Tokens to self-assess whether retrieval is needed, document relevance, and response quality.

Question input
[Retrieve] token?
Yes
Perform retrieval
[IsRel] Relevant?
Generate
[IsSup] Supported?
No
Generate directly
Key insight: Instead of retrieving for every question, the LLM only retrieves when needed. This reduces noise from unnecessary retrieval.

CRAG (Corrective RAG)

Yan et al., 2024

Classifies search results as Correct / Incorrect / Ambiguous and selects different correction paths accordingly.

Perform retrieval
Retrieval Evaluator
Correct

Refine documents, then generate

Incorrect

Fall back to web search

Ambiguous

Combine documents + web search

Results: Self-CRAG reported accuracy improvements of 19~37% on benchmarks.

Graph RAG

Microsoft, 2024

Builds a knowledge graph and performs retrieval based on entity relationships. Excels at "global questions" that connect information scattered across multiple documents.

Co. A Tech X Mkt Y Prod B Comp C
Key point: Generates hierarchical community summaries, enabling answers to broad questions like "What are the overall trends in this industry?"

RAPTOR

Sarthi et al., ICLR 2024

Recursive tree summarization — Repeatedly clusters and summarizes document chunks to build a multi-level abstraction tree.

Overall Summary
Cluster Summary A
Cluster Summary B
Chunk 1
Chunk 2
Chunk 3
Chunk 4
Results: GPT-4 based +20% improvement on QuALITY benchmark. Detailed questions retrieve from leaf nodes; summary questions retrieve from upper nodes.

Agentic RAG

An AI Agent uses retrieval as a tool while running a plan-execute-reflect loop.
Highly effective for complex multi-step questions.

AGENT ORCHESTRATOR Plan Decompose subtasks Reason Judge & select Reflect Self quality assessment U Query Complex question Response Final answer + sources Complete TOOLS Vector Search Vector DB search Semantic matching Web Search Internet search Real-time info SQL Query Structured data Table queries Code Exec Code execution Computation & analysis API Call External services Data integration Memory Conversation history Search cache / intermediate results Iteration Loop Step 1: Gather information Step 2: Analyze & reason Step N: Refine answer Re-run if insufficient

Multi-Agent Architecture

Specialized agents for retrieval, summarization, evaluation, and more collaborate by role. You can also deploy modality-specific expert agents (text, image, table, etc.).

Tool Use

Beyond vector search, the agent selects and uses various tools like web search, SQL queries, calculators, and code execution as needed.

Adaptive Retrieval

Not every question triggers retrieval. The agent assesses question complexity: simple questions get immediate answers, while only complex questions go through multi-step retrieval.

"What is Python?" → Answer directly
"Analyze our Q3 revenue compared to the same period last year" → Multi-step retrieval

Speculative RAG

Additional Technique

A small specialist model generates multiple drafts in parallel, and a larger generalist model verifies them. The Draft-then-Verify pattern improves both accuracy and latency.

Evaluating RAG

How do we measure the quality of a RAG system? We evaluate retrieval and generation separately.

F

Faithfulness

Is the response faithful to the retrieved documents? Measures whether the answer is based solely on document content without fabrication.

Faithfulness = Claims supported by documents / Total claims
R

Answer Relevancy

Is the response relevant to the question? Evaluates whether irrelevant content is included in the answer.

Relevancy = Average similarity between generated questions from response and original question
P

Context Precision

Are the retrieved documents precise? Checks whether too many irrelevant documents are included.

Precision = Relevant documents / Total retrieved documents
C

Context Recall

Were all necessary documents retrieved without omission? Checks whether all information needed for the correct answer is included in the search results.

Recall = Retrieved relevant documents / Total relevant documents

Evaluation Tools

RAGAS

The most popular open-source evaluation framework. Automatically computes core metrics like Faithfulness, Relevancy, Context Precision/Recall.

Open Source

DeepEval

Unit-test style evaluation that integrates into CI/CD pipelines. Automates RAG evaluation just like pytest.

CI/CD Integration

TruLens

Real-time monitoring for production environments. Enables continuous quality tracking with feedback functions.

Production Monitoring

RAGBench

A benchmark with 100K+ examples across 12 domains. Used for comparing RAG performance across industries.

Benchmark

Practical Guide

Recommendations and a roadmap for applying RAG systems in production.

Recommended Production Stack

Application
LangChain / LlamaIndex / Custom
LLM
Claude / GPT-4 / Gemini / Open-source
Reranker
Cohere Rerank / bge-reranker / FlashRank
Embedding
Cohere embed-v4 / OpenAI text-embedding-3 / BGE / E5
Vector DB
Pinecone / Weaviate / Qdrant / pgvector / Chroma

Step-by-Step Adoption Roadmap

1

MVP: Naive RAG

Build a basic pipeline. Fixed-size chunking + simple vector search + LLM generation. The goal is to quickly prove value.

2

Quality Boost: Hybrid + Reranker

Introduce BM25 + Dense hybrid search. Add a Reranker. This alone can deliver significant quality improvements.

3

Chunking Optimization

Apply Recursive/Semantic chunking. Add context to chunks with Contextual Retrieval. Metadata tagging.

4

Build an Evaluation Framework

Build an automated evaluation pipeline with RAGAS and similar tools. Measure improvement effects with quantitative metrics.

5

Advanced: Modular / Agentic

When multi-step questions and multi-source scenarios are needed. Introduce routers, iterative retrieval, and agent patterns.

Chunking Strategy Comparison

Strategy Method Pros Cons Recommended For
Fixed-size Split by fixed token count Simple to implement, predictable Context breakage Quick prototypes
Recursive Recursively split using separator hierarchy Stable (69% win rate) Requires separator configuration General purpose (default recommendation)
Semantic Split at semantic shift points High recall Chunks may be too small Multi-topic documents
Parent-Child Search with small chunks, return large chunks Precise search + rich context Increased index complexity Long documents, reports
Sentence Window Sentence-level + surrounding sentences Sentence-level precision Inefficient for short documents FAQs, manuals
💡

256~512 Tokens is Optimal

Chunk sizes of 256~512 tokens are the most stable. Too small means insufficient context, too large increases noise.

🎯

Optimize Retrieval First

Before switching LLMs, improve retrieval quality. Improving retrieval alone with the same model can yield 50%+ accuracy improvement.

📊

No Measurement, No Improvement

Build the evaluation pipeline first, and compare metrics with every change. Make decisions based on data, not intuition.

🔄

RAG is a Product

It's not a one-time build. Set KPIs and continuously improve. Update documents, swap models, tune pipelines.

Key Papers Timeline

A chronological overview of the essential papers in RAG research.

2020

RAG: Retrieval-Augmented Generation

Lewis et al. — NeurIPS 2020

Combines DPR + BART. Proposed two variants: RAG-Sequence and RAG-Token. The starting point of RAG research.

2020

Dense Passage Retrieval (DPR)

Karpukhin et al. — EMNLP 2020

The standard for dense vector-based document retrieval. Achieved semantic search surpassing traditional TF-IDF/BM25.

2022

HyDE: Hypothetical Document Embeddings

Gao et al., 2022

Generates a hypothetical answer instead of using the question directly for retrieval. Achieves fine-tuned model-level retrieval performance even in zero-shot settings.

2023

Self-RAG: Self-Reflective RAG

Asai et al. — ICLR 2024 (Oral)

Uses Reflection Tokens for the LLM to self-assess retrieval necessity, document relevance, and response quality.

2024

CRAG: Corrective RAG

Yan et al., 2024

Classifies retrieval results as Correct/Incorrect/Ambiguous and selects correction paths. 19~37% accuracy improvement.

2024

RAPTOR: Recursive Abstractive Processing

Sarthi et al. — ICLR 2024

Builds a multi-level abstraction tree through recursive clustering + summarization. +20% improvement on QuALITY.

2024

GraphRAG

Microsoft Research, 2024

Knowledge graph + hierarchical community summaries. Excels at global questions spanning multiple documents.

2024

Contextual Retrieval

Anthropic, 2024

Adds document-level context as a prefix to chunks. Combined with BM25, reduces retrieval failure rate by 67%.

2024

Adaptive RAG

Jeong et al., 2024

Dynamically selects No Retrieval / Single-step / Multi-step based on question complexity.

2024

Late Chunking

Jina AI, 2024

Embeds the full document first, then extracts chunks from token vectors. Preserves context without additional LLM cost. API support in jina-v3.

2024

RAFT (Retrieval Augmented Fine-Tuning)

UC Berkeley, 2024

Combines RAG + Fine-tuning. Trains the model to ignore distractor documents. Outperforms pure RAG in specialized domains.

2025

SimRAG (Self-Improving RAG)

NAACL 2025

Generates self-training QA pairs from unlabeled corpora. Domain adaptation without labeling costs. Validated on 11 datasets.

2025

MCP (Model Context Protocol)

Anthropic, 2025

An open standard protocol connecting AI models with external tools and data. Standardizes tool integration for Agentic RAG.

2026

Context Engineering & Knowledge Runtime

Industry Paradigm Shift

The paradigm expands from RAG to Context Engineering. The Knowledge Runtime concept emerges, integrating RAG (static knowledge) + Memory (dynamic history) + MCP (tool connectivity).