An engineering reference for designing production-grade RAG systems.
Covers architecture patterns, benchmarks, trade-offs, and operational best practices.
Overall architecture of a production RAG system and key design decision points.
Chunking is the foundation of RAG quality. The right strategy can improve retrieval accuracy by 20-40%.
| Strategy | How It Works | Optimal Size | Precision | Recall | Complexity | Recommended For |
|---|---|---|---|---|---|---|
Fixed-size |
Fixed token count + overlap | 256~512 tok | ●●●○ | ●●●○ | Low | MVP, rapid prototyping |
Recursive |
Recursive split by delimiter hierarchy | 512 tok | ●●●● | ●●●● | Low | General-purpose (default) |
Semantic |
Split at embedding similarity breakpoints | Variable | ●●●○ | ●●●● | Medium | Multi-topic docs, transcripts |
Parent-Child |
Search small chunks → Return large chunks | 128 / 512 | ●●●● | ●●●● | High | Long reports, technical docs |
Document Summary |
Search by doc summary → Return original | Summary 200 tok | ●●●● | ●●●○ | High | Global questions, summary answers |
Prepend document-level context as a prefix to each chunk before embedding. The LLM automatically generates contextual sentences.
"Revenue increased 15% year-over-year."
"[Q3 2024 Earnings Report, Operating Profit Section] Revenue increased 15% year-over-year."
Traditional: chunk first → embed each (context loss). Late Chunking: embed the full document first → extract chunks from token vectors. Long-context models (8K+ tokens) attend to the entire document, so each chunk naturally reflects its surrounding context.
Document → [chunk1, chunk2, chunk3] → [embed(chunk1), embed(chunk2), embed(chunk3)]
Document → embed(full document) → token vectors → [pool(tokens1~50), pool(tokens51~100), ...]
An LLM Agent analyzes document characteristics and dynamically determines the chunking strategy itself. Research papers → Semantic Chunking, financial reports → page-based, code files → function-level. Fundamentally solves the limitations of fixed strategies, but increases ingestion cost.
10-20% overlap between chunks is recommended. Prevents context breaks. Over 50% causes duplicate noise.
AST-based splitting is most effective. Split by function/class and prepend signatures as prefixes.
Convert tables to Markdown/HTML and keep as a single chunk. Row-level splitting destroys meaning.
Contextual Retrieval incurs LLM cost but works with any embedding model. Late Chunking is free but requires compatible models. If both are available, combining them is best.
Embedding model choice determines over 50% of retrieval quality. Domain benchmarks are key.
| Model | Dim | Max Tokens | MTEB Avg | Multilingual | Features |
|---|---|---|---|---|---|
Cohere embed-v4 |
1024 | 128K | 65.2 | 100+ langs | MTEB #1, multimodal, compression |
OpenAI text-embedding-3-large |
3072 | 8191 | 64.6 | Multilingual | Matryoshka (dimension reduction) |
Voyage-3-large |
1024 | 32K | 64.8 | Multilingual | Separate code-specific model |
BGE-M3 |
1024 | 8192 | 62.1 | 100+ langs | Dense+Sparse+ColBERT output |
Jina-embeddings-v3 |
1024 | 8192 | 61.5 | 89 langs | Task-specific LoRA, open-source |
E5-Mistral-7B |
4096 | 32K | 61.8 | English-centric | LLM-based, instruction support |
Multilingual performance matters. BGE-M3 is open-source and self-hostable. Cohere offers the best API quality.
Code-specific models are essential. General embedding models are weak at understanding code semantics.
Documents with PDF scans, charts, diagrams. Direct embedding without text extraction.
Reducing dimensions from 3072 → 256 loses only 5% accuracy. Saves 12x storage.
Retrieval quality sets the upper bound of overall RAG performance. Improving retrieval alone can yield 50%+ gains with the same LLM.
Rewrite the original question from 3-5 perspectives, search each, and combine results via RRF.
LLM generates a hypothetical answer → search using the answer's embedding. Based on the insight that answers are more similar to documents than questions.
Generate a one-step-abstracted question from a specific query to search a broader scope.
Break complex questions into sub-questions. Search each independently, then synthesize.
Multi-Query + Reciprocal Rank Fusion. Combines multi-query results using rank-based scoring.
score(d) = Σ 1 / (k + rank_i(d)) // k = 60 (typical)
score(d) = α · dense_score(d) + (1-α) · sparse_score(d) // α ∈ [0, 1]
Stores per-token embeddings and computes similarity via MaxSim (Maximum Similarity). Sums the highest-matching document token score for each query token.
Post-retrieval re-ranking. The stage that delivers the biggest quality improvement with the least effort.
| Model | Type | Latency | NDCG@10 | Cost | Features |
|---|---|---|---|---|---|
Cohere Rerank 3.5 |
Cross-Encoder | ~80ms | ●●●● | $2/1K req | API, multilingual, production-proven |
bge-reranker-v2-m3 |
Cross-Encoder | ~60ms | ●●●● | Self-host | Open-source, multilingual, GPU required |
Jina-reranker-v2 |
Cross-Encoder | ~50ms | ●●●○ | API/$0.02 | Lightweight, multilingual |
FlashRank |
Cross-Encoder | ~15ms | ●●●○ | Free/Self | Ultra-lightweight, CPU inference, fast |
RankGPT / LLM Reranker |
LLM Listwise | ~500ms+ | ●●●● | LLM API cost | Highest accuracy, high cost/latency |
Top-100 (vector search) → Top-20 (BM25 filter) → Top-5 (Reranker). Search broadly, then progressively narrow down.
FlashRank for Top-100→Top-20, Cohere for Top-20→Top-5. Two-stage reranking optimizes cost/accuracy.
Strategies for effectively passing retrieved context to the LLM and generating faithful responses.
Force source attribution in [1], [2] format for each claim in the response. Reduces hallucination + enables verification.
LLMs remember the beginning and end of context well but struggle with the middle. Place the most relevant documents first.
Don't feed retrieved documents as-is — extract only relevant sentences. Saves tokens + reduces noise.
For each retrieved document, note whether it's relevant to the question, then synthesize. Filters out noisy documents.
Beyond simple pipelines — agent patterns that autonomously retrieve, judge, and iterate.
Self-judgment via 4 Reflection Tokens:
[Retrieve]
Is retrieval needed? Yes/No/Continue
[IsRel]
Is the retrieved document relevant? Relevant/Irrelevant
[IsSup]
Is the response grounded in the document? Fully/Partially/No
[IsUse]
Is the final response useful? Rating 1-5
A lightweight evaluator classifies retrieval results as Correct / Incorrect / Ambiguous, then selects a correction path:
A lightweight classifier judges question complexity and selects the optimal strategy:
A training recipe that combines RAG with fine-tuning. During training, the model is given relevant documents + distractor documents together, and learns to ignore the distractors. Achieves higher accuracy than vanilla RAG in specialized domains like medical and legal. Consider when sufficient domain data is available.
The LLM self-generates QA pairs from an unlabeled corpus, applies quality filtering, then self-trains. Improves domain-specific RAG performance without labeling costs. Validated across 11 datasets and 3 domains.
Moving beyond the narrow "RAG" pattern to systematically engineering the entire context delivered to the LLM. Expanding into a "Knowledge Runtime" concept that integrates RAG (static knowledge retrieval) + Memory (dynamic conversation history) + MCP (tool/service connections).
Model Context Protocol (MCP) is an open standard for connecting AI models to external data/tools. Previously, each tool required custom code, but MCP provides a unified interface for vector DBs, SQL DBs, web search, and APIs. Makes Agentic RAG's tool selection dynamic and extensible.
No measurement, no improvement. Evaluate retrieval and generation separately, and continuously monitor in production.
Ratio of relevant documents among retrieved ones. Relevant docs / Total retrieved docs
Ratio of needed information included in retrieval. Retrieved relevant docs / Total relevant docs
Average rank of the first relevant document. Measures retrieval ordering quality.
Is the response faithful to the retrieved documents? Measures fabricated content ratio. The most important metric.
Is the response appropriate to the question? Checks for irrelevant content.
Factual accuracy compared to ground truth. Requires ground truth data.
Open-source automated evaluation. LLM-based evaluation without separate ground truth. CI/CD integration ready.
Pytest-style RAG unit tests. Prevent regressions with assert_test. Build pipeline integration.
A stronger LLM evaluates a weaker LLM's output. Pairwise comparison, pointwise scoring, reference-based.
Real-time tracing, feedback function-based quality tracking. Build user feedback loops.
A practical checklist for running RAG reliably in production.
Compare similar query embeddings → skip retrieval/LLM on cache hit. 20-40% cost reduction. Redis + cosine similarity > 0.95.
Reduce dimensions from 3072d → 256d. 12x storage savings, 3x faster search. Only 5% accuracy loss.
Simple questions get direct LLM answers without retrieval. 30-50% retrieval cost savings. Implement with a lightweight classifier.
Compress context with LLMLingua. Save 50-70% tokens. Direct LLM API cost reduction.