The technique that lets LLMs take an open-book exam.
By augmenting knowledge through retrieval, it produces more accurate and trustworthy AI responses.
LLMs are powerful, but they have fundamental limitations that are hard to overcome on their own.
LLMs fabricate plausible-sounding answers even when the information is not in their training data. Confidently generating wrong answers without fact-checking is the biggest weakness of LLMs.
LLMs don't know information after their training data cutoff date. They cannot access yesterday's news, latest API changes, or real-time data.
LLMs cannot answer questions about private data such as internal company documents or specialized medical/legal knowledge. Fine-tuning is costly and time-consuming.
LLMs cannot cite their sources. They can't answer "Where did you get this information?", making it difficult to verify trustworthiness.
Retrieval-Augmented Generation — generation augmented by retrieval.
Before generating an answer, the LLM retrieves relevant documents and uses them as reference.
Closed-book exam
Relies only on memorized knowledge
Open-book exam
Answers while consulting reference materials
Converting text into numerical vectors. Since "cat" and "kitten" have similar meanings, they are positioned close together in vector space. This enables semantic search rather than keyword matching.
A specialized DB that stores embedding vectors and performs fast similarity search. It finds and returns the document vectors closest to the query vector.
Searching by semantic similarity rather than keyword matching. Searching for "how to get a raise" can also find "salary negotiation strategies". Cosine similarity measures the distance between vectors.
Proposed two variants: RAG-Sequence and RAG-Token. Uses DPR (Dense Passage Retrieval) for retrieval and BART for generation. Presented at NeurIPS 2020, marking the starting point of RAG research.
RAG has evolved across three generations. Let's examine how each generation overcomes the limitations of its predecessor.
A fixed pipeline. Query → Retrieve → Generate. Simple but with clear limitations.
Query → Retrieve → Generate
Optimizes pre/during/post retrieval. Adds query rewriting, hybrid search, reranking, and more.
Query Optimization → Hybrid Retrieve → Rerank → Generate
Modular composition. Dynamically configures routers, evaluators, iterative retrieval, and more to match the task.
Router → [Dynamic Module Composition] → Self-Evaluation → Iterate/Complete
The most basic "Retrieve → Read" pipeline. The starting point of RAG, but with clear limitations.
When retrieval quality is poor, the response degrades too. If irrelevant documents are retrieved, the LLM generates incorrect answers based on them.
Splitting documents by fixed size cuts context or mixes multiple topics into a single chunk, creating noise.
Retrieved documents are passed to the LLM as-is even when irrelevant to the query. There is no filtering or quality assessment.
There is no logic to handle duplicate documents or contradictory information, which can confuse the LLM.
Systematically addresses the limitations of Naive RAG at each stage: pre/during/post retrieval.
Transforms user questions into forms optimized for retrieval. Converts colloquial language into search queries, splits compound questions into single queries, etc.
Hypothetical Document Embeddings — The LLM first generates a hypothetical answer, then uses its embedding for retrieval. The insight is that answers are more similar to documents than questions are.
Expands a single question from multiple perspectives to broaden the search scope. Techniques include Multi-Query, Sub-Question Decomposition, and more.
Splits documents by semantic units rather than fixed sizes.
Adds document-level context as a prefix to each chunk before embedding. By prepending context like "This chunk is excerpted from the Q3 performance section of the 2024 revenue report," retrieval accuracy improves significantly.
Attaches supplementary information such as date, source, category, and author to documents. When combined with filtering, retrieval precision increases significantly.
Combines Dense (vector) search and Sparse (BM25 keyword) search. Captures both semantic similarity and keyword precision for the most reliable performance.
Further trains embedding models with domain-specific data. Reports show 12~30% performance improvement over general models. Highly effective in specialized domains like medical, legal, and finance.
Layers summary indexes, full-text indexes, and metadata indexes. First narrows down relevant documents via summaries, then retrieves details from full text.
Stores per-token embeddings and computes similarity via MaxSim. Achieves a good balance between Bi-Encoder speed and Cross-Encoder accuracy. ColPali/ColQwen extend this to multimodal.
Re-evaluates search results using a Cross-Encoder and reorders the rankings. Since it examines question-document pairs together, it enables more precise relevance judgments.
MAP (Mean Average Precision) 52% improvement reported — Cohere Rerank, bge-reranker, FlashRank, etc.
Extracts only relevant portions from retrieved documents to save context window space. Methods like LongLLMLingua retain only key sentences.
Removes duplicate documents and ensures diversity using similarity thresholds, MMR (Maximal Marginal Relevance), and other techniques.
A paradigm that breaks the pipeline into modular units instead of a fixed sequence, composing them dynamically to match the task.
Classifies the question type and decides whether retrieval is needed and which source to use. Simple questions go directly to LLM generation.
Evaluates whether search results are sufficient. If insufficient, triggers re-retrieval or switches to a different source.
The LLM itself judges "Is retrieval needed right now?" Reduces unnecessary retrieval to improve efficiency.
Dynamically selects from multiple sources: Vector DB, web search, SQL DB, APIs, and more.
Repeats the retrieve-generate cycle multiple times, not just once, to refine the answer.
Leverages previous conversations and search history to maintain context.
The LLM uses Reflection Tokens to self-assess whether retrieval is needed, document relevance, and response quality.
Classifies search results as Correct / Incorrect / Ambiguous and selects different correction paths accordingly.
Refine documents, then generate
Fall back to web search
Combine documents + web search
Builds a knowledge graph and performs retrieval based on entity relationships. Excels at "global questions" that connect information scattered across multiple documents.
Recursive tree summarization — Repeatedly clusters and summarizes document chunks to build a multi-level abstraction tree.
An AI Agent uses retrieval as a tool while running a plan-execute-reflect loop.
Highly effective for complex multi-step questions.
Specialized agents for retrieval, summarization, evaluation, and more collaborate by role. You can also deploy modality-specific expert agents (text, image, table, etc.).
Beyond vector search, the agent selects and uses various tools like web search, SQL queries, calculators, and code execution as needed.
Not every question triggers retrieval. The agent assesses question complexity: simple questions get immediate answers, while only complex questions go through multi-step retrieval.
A small specialist model generates multiple drafts in parallel, and a larger generalist model verifies them. The Draft-then-Verify pattern improves both accuracy and latency.
How do we measure the quality of a RAG system? We evaluate retrieval and generation separately.
Is the response faithful to the retrieved documents? Measures whether the answer is based solely on document content without fabrication.
Faithfulness = Claims supported by documents / Total claims
Is the response relevant to the question? Evaluates whether irrelevant content is included in the answer.
Relevancy = Average similarity between generated questions from response and original question
Are the retrieved documents precise? Checks whether too many irrelevant documents are included.
Precision = Relevant documents / Total retrieved documents
Were all necessary documents retrieved without omission? Checks whether all information needed for the correct answer is included in the search results.
Recall = Retrieved relevant documents / Total relevant documents
The most popular open-source evaluation framework. Automatically computes core metrics like Faithfulness, Relevancy, Context Precision/Recall.
Open SourceUnit-test style evaluation that integrates into CI/CD pipelines. Automates RAG evaluation just like pytest.
CI/CD IntegrationReal-time monitoring for production environments. Enables continuous quality tracking with feedback functions.
Production MonitoringA benchmark with 100K+ examples across 12 domains. Used for comparing RAG performance across industries.
BenchmarkRecommendations and a roadmap for applying RAG systems in production.
Build a basic pipeline. Fixed-size chunking + simple vector search + LLM generation. The goal is to quickly prove value.
Introduce BM25 + Dense hybrid search. Add a Reranker. This alone can deliver significant quality improvements.
Apply Recursive/Semantic chunking. Add context to chunks with Contextual Retrieval. Metadata tagging.
Build an automated evaluation pipeline with RAGAS and similar tools. Measure improvement effects with quantitative metrics.
When multi-step questions and multi-source scenarios are needed. Introduce routers, iterative retrieval, and agent patterns.
| Strategy | Method | Pros | Cons | Recommended For |
|---|---|---|---|---|
| Fixed-size | Split by fixed token count | Simple to implement, predictable | Context breakage | Quick prototypes |
| Recursive | Recursively split using separator hierarchy | Stable (69% win rate) | Requires separator configuration | General purpose (default recommendation) |
| Semantic | Split at semantic shift points | High recall | Chunks may be too small | Multi-topic documents |
| Parent-Child | Search with small chunks, return large chunks | Precise search + rich context | Increased index complexity | Long documents, reports |
| Sentence Window | Sentence-level + surrounding sentences | Sentence-level precision | Inefficient for short documents | FAQs, manuals |
Chunk sizes of 256~512 tokens are the most stable. Too small means insufficient context, too large increases noise.
Before switching LLMs, improve retrieval quality. Improving retrieval alone with the same model can yield 50%+ accuracy improvement.
Build the evaluation pipeline first, and compare metrics with every change. Make decisions based on data, not intuition.
It's not a one-time build. Set KPIs and continuously improve. Update documents, swap models, tune pipelines.
A chronological overview of the essential papers in RAG research.
Combines DPR + BART. Proposed two variants: RAG-Sequence and RAG-Token. The starting point of RAG research.
The standard for dense vector-based document retrieval. Achieved semantic search surpassing traditional TF-IDF/BM25.
Generates a hypothetical answer instead of using the question directly for retrieval. Achieves fine-tuned model-level retrieval performance even in zero-shot settings.
Uses Reflection Tokens for the LLM to self-assess retrieval necessity, document relevance, and response quality.
Classifies retrieval results as Correct/Incorrect/Ambiguous and selects correction paths. 19~37% accuracy improvement.
Builds a multi-level abstraction tree through recursive clustering + summarization. +20% improvement on QuALITY.
Knowledge graph + hierarchical community summaries. Excels at global questions spanning multiple documents.
Adds document-level context as a prefix to chunks. Combined with BM25, reduces retrieval failure rate by 67%.
Dynamically selects No Retrieval / Single-step / Multi-step based on question complexity.
Embeds the full document first, then extracts chunks from token vectors. Preserves context without additional LLM cost. API support in jina-v3.
Combines RAG + Fine-tuning. Trains the model to ignore distractor documents. Outperforms pure RAG in specialized domains.
Generates self-training QA pairs from unlabeled corpora. Domain adaptation without labeling costs. Validated on 11 datasets.
An open standard protocol connecting AI models with external tools and data. Standardizes tool integration for Agentic RAG.
The paradigm expands from RAG to Context Engineering. The Knowledge Runtime concept emerges, integrating RAG (static knowledge) + Memory (dynamic history) + MCP (tool connectivity).