Skip to content

Ch 12. Improving Retrieval Quality

Open in Colab

What you'll learn

  • Answer quality is 70% retrieval quality — why generation alone can't fix a bad search
  • Five types of retrieval failure (recall · precision · ranking · metadata · chunk boundaries)
  • BM25 + Dense hybrid search (RRF merging)
  • Cross-encoder reranker — boost precision by reordering your top candidates
  • MMR (diversity) and metadata filters
  • Tradeoffs between chunk size · top-k · rerank cost

Prerequisites

You've run mini_rag.py from Ch 11.


1. Concept — Retrieval sets the ceiling on answers

In Ch 11's query pipeline, if you pull back irrelevant documents, the LLM tries to answer from them anyway. You get hallucination or "I don't know."

Generation can't recover from retrieval failure.

So the biggest time sink in Part 3 is improving retrieval quality. This chapter is your toolbox.


2. Five types of retrieval failure

Type Symptom Root cause
Recall miss Relevant doc never in top-k Query–doc representation gap · chunk boundary issues
Precision miss Junk docs mixed into top-k ANN score alone · no diversity
Ranking error Relevant doc exists but ranks low Dense-only limits (semantic similarity ≠ exact match)
Metadata ignored Stale or private docs in top-k No updated_at or owner filters
Chunk boundary Answer spans two chunks Fixed-length chunking

Each type needs different tools. This chapter maps them.


3. Where it's used

The techniques here go into every production RAG:

  • FAQ bot using just top-1? → rerank to fix ranking
  • Long policy documents? → BM25 hybrid for exact clause matches
  • Multilingual docs? → metadata filter by language
  • Freshness matters? → updated_at filter

4. Minimal example — Dense vs BM25 vs Hybrid

pip install rank-bm25
dense_vs_bm25.py
from openai import OpenAI
from rank_bm25 import BM25Okapi
import numpy as np

openai = OpenAI()

docs = [
    "Refunds available within 7 days of purchase, manager approval required.",
    "Shipping: 2–3 business days, +2 days for remote areas.",
    "Points: 1% of purchase amount, valid for 3 months.",
    "Warranty: 1 year from purchase date.",
    "Non-refundable items: custom-made products, opened software.",
]

# Dense
def embed(texts):
    r = openai.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in r.data])

doc_vecs = embed(docs)

# BM25
tokenized = [d.split() for d in docs]           # (1)!
bm25 = BM25Okapi(tokenized)

query = "I want my money back"
q_vec = embed([query])[0]

# Dense top-3
dense_scores = doc_vecs @ q_vec
dense_top = np.argsort(-dense_scores)[:3]

# BM25 top-3
bm25_scores = bm25.get_scores(query.split())    # (2)!
bm25_top = np.argsort(-bm25_scores)[:3]

print("Dense :", [docs[i] for i in dense_top])
print("BM25  :", [docs[i] for i in bm25_top])
  1. Whitespace splitting is rough for most languages. Production code needs a proper tokenizer (kiwi, KoNLPy for Korean, spaCy for others).
  2. BM25 only scores high if "refund" or "money" appears verbatim. It has no semantic knowledge.

Key observations:

  • Dense finds meaning without the exact word ("I want my money back" → refund doc)
  • BM25 matches only documents with the word present
  • Both miss cases the other catches → combine them (§5.2)

5. Hands-on

5.1 Hybrid search pipeline

Hybrid search + Reranker Hybrid search + Reranker

  • Dense misses exact terms and numbers that BM25 catches
  • BM25 misses synonyms and paraphrases that Dense catches
  • Merge their top-N results using RRF (Reciprocal Rank Fusion):
rrf.py
def rrf_merge(ranked_lists: list[list[int]], k: int = 60) -> dict:
    """Merge ranked lists by converting each rank to 1/(k+rank) and summing."""
    scores = {}
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    return scores

# Usage
dense_top20 = list(np.argsort(-dense_scores)[:20])
bm25_top20  = list(np.argsort(-bm25_scores)[:20])
merged = rrf_merge([dense_top20, bm25_top20])
final_top5 = sorted(merged, key=merged.get, reverse=True)[:5]

RRF is the standard way to merge two rankers with different score scales. Parameter k=60 works well in most cases.

5.2 Reranker — re-sort your candidates

Dense + BM25 might return top-10, but relevant docs could rank low. A cross-encoder reranker fixes that.

Reranker effect Reranker effect

Dense vs cross-encoder:

Dense (Bi-encoder) Cross-encoder (Reranker)
Method Embed query and doc separately, then score with dot product Feed query + doc together, output relevance
Accuracy Medium High
Speed Very fast (millions of docs) Slower (practically top-50 max)
Role Recall (cast wide) Precision (rank tight)

In practice: Dense/BM25 gets you top-20–50, then cross-encoder re-sorts just the top 5–10.

rerank.py
import voyageai  # or cohere

vo = voyageai.Client()

def rerank(query: str, docs: list[str], top_k: int = 5):
    r = vo.rerank(
        query=query,
        documents=docs,
        model="rerank-2",
        top_k=top_k,
    )
    return [(item.document, item.relevance_score) for item in r.results]

# Usage
candidates = [docs[i] for i in merged_top20]
final = rerank("I want my money back", candidates, top_k=5)
for doc, score in final:
    print(f"{score:.3f}  {doc}")

Alternatives:

  • Voyage rerank-2 — multilingual · higher accuracy
  • Cohere rerank-v3 — established standard
  • BGE Reranker (open-source) — runs on your GPU
  • Custom cross-encoder (HuggingFace sentence-transformers/ms-marco-*)

5.3 MMR — diversity

If top-5 is all near-duplicates, information density plummets. MMR (Maximal Marginal Relevance) balances relevance and diversity:

score(doc) = λ · relevance(doc) − (1−λ) · max_sim(doc, already_selected)

Built into Chroma and LangChain:

results = col.query(
    query_embeddings=[q_emb],
    n_results=10,
    # LangChain: search_type="mmr", search_kwargs={"lambda_mult": 0.5}
)

With λ=0.5, you balance relevance and diversity equally. Lower it to emphasize diversity when you're getting too many similar chunks.

5.4 Metadata filters

Filter before search using where conditions:

filter_examples.py
# Only recent documents
col.query(
    query_embeddings=[q_emb],
    n_results=10,
    where={"updated_at": {"$gte": "2026-01-01"}},
)

# Only from one team (access control)
col.query(
    query_embeddings=[q_emb],
    n_results=10,
    where={"$and": [{"owner": "finance"}, {"lang": "en"}]},
)

Filter first, then search — you get better latency. Your first line of defense for access control · freshness · language separation.

5.5 Parameter tuning — chunk size · top-k · overlap

Set once and forget? No. You'll cycle through failure analysis → adjust repeatedly:

Parameter Increase it Decrease it
Chunk size Preserve context · lower precision · more tokens Finer detail · lose context · fewer tokens
Chunk overlap Protect boundaries · redundancy grows Better index efficiency · boundary cutoff risk
Top-k Higher recall · more noise · more prompt tokens Tight results · miss relevant docs

Starting point (English and mixed-language):

  • chunk_size=512, overlap=64, top-k=10 → rerank down to top-5

5.6 Collect and classify retrieval failures

retrieval_log.py
def log_retrieval(query, retrieved, used, user_feedback=None):
    """Log search results and user feedback."""
    record = {
        "query": query,
        "retrieved_ids": [r["id"] for r in retrieved],
        "retrieved_scores": [r["score"] for r in retrieved],
        "used_id": used["id"],           # what the LLM actually cited
        "feedback": user_feedback,       # 👍/👎
    }
    # Append to DB or file

Weekly review: Take 20 cases with 👎 feedback and bucket them into the 5 failure types from §2. Fix the most common first (that feeds into Ch 13 Advanced RAG).


6. Common breaking points

Mistake 1: Shrink chunks blindly

Recall goes up, but context gets cut and answer quality can drop. Tiny chunks in your prompt lose their relationship to each other.
Fix: Start with chunk_size=512 · overlap=64. Adjust only after failure analysis.

Mistake 2: Rerank top-500

Cross-encoders are slow. Reranking 500 docs pushes p95 latency to 5–10 seconds.
Fix: Dense/BM25 gets you top-20–50 first, then rerank only that.

Mistake 3: Expect fresh docs without filters

A question like "What's today's announcement?" returns a 3-year-old doc at rank 1. Semantic similarity doesn't capture recency.
Fix: Use updated_at filter + recency boost (multiply score by time decay).

Mistake 4: Tokenize Korean with just split()

"환불해주세요" becomes one token; BM25 fails to match. Semantically it's there, but no word overlap.
Fix: Use a morphological analyzer (kiwi, KoNLPy) before BM25.

Mistake 5: Don't track reranker cost

Voyage and Cohere rerank APIs cost money and add latency. Costs grow fast with top-k and call volume.
Fix: Log cost and latency per call → monthly budget. Consider switching to local BGE reranker.


7. Production checklist

  • Weekly review of 5 failure types (at least 20 samples)
  • Parameter log — chunk_size, overlap, top-k, rerank top-n
  • Dense vs BM25 vs Hybrid A/B test → pick your baseline
  • Reranker cost/latency dashboard (price per Voyage/Cohere call)
  • Metadata filter required — access control · freshness · language
  • Recency boost formula documented (e.g., score * exp(-days/30))
  • MMR lambda (λ) tuned per use case
  • Retrieval failure logs (separate bucket for top-1 score below threshold)

8. Practice problems

  • Run §4's dense_vs_bm25.py on 10 English and 10 non-English sentences, list strengths per type
  • Compare RRF merge output to each ranker's top-5 — quantify the merge benefit
  • Attach Voyage rerank-2 to §5.2, measure precision before/after top-10 → top-5
  • Vary chunk_size to 256 · 512 · 1024 on the same query, log precision/recall deltas
  • Artificially make old docs rank first, then solve with updated_at filter

9. Further reading


NextCh 13. Advanced RAG
Foundations set. Now HyDE · Self-RAG · GraphRAG · Agentic RAG — production-grade techniques from the papers.