Ch 10. Embeddings and Vector Search Basics¶
What you'll learn
- Embeddings — turning sentences into high-dimensional vectors and why that's useful
- Cosine similarity to measure "meanings are close"
- What vector databases (Chroma · Qdrant · Pinecone) actually do
- ANN (Approximate Nearest Neighbor) and why you don't brute-force search
- Real-world traps: dimension mismatch · mixed languages · normalization
Prerequisites
Read Ch 9 "Why RAG?" and understand RAG's role in the pipeline.
1. Concept — sentences as numbers¶
You know intuitively that "dog" and "cat" are similar, while "dog" and "pizza" are far apart. A computer sees none of that. By themselves, the text strings are just different characters.
Embeddings turn that intuition into numbers:
"dog" → [0.12, -0.03, 0.87, ..., 0.41] (1536 numbers)
"cat" → [0.14, -0.01, 0.82, ..., 0.39] (points almost the same direction)
"pizza" → [-0.40, 0.78, -0.12, ..., 0.05] (points completely different)
Similar meaning becomes nearby vectors. That's the whole idea.
In reality that's 1536 dimensions (OpenAI text-embedding-3-small). The picture above shrinks it to 2D for intuition.
2. Why you need this — the limits of keyword search¶
Old-school keyword search (BM25 etc.) only finds exact word matches:
| Query | Traditional search result |
|---|---|
| "refund" | Documents with "refund" in them |
| "get my money back" | No refund docs found (keyword mismatch) |
| "refund policy" | Documents in English only |
Semantic search based on embeddings finds meaning, not keywords:
- "get my money back" ≈ "refund" → same cluster
- "refund policy" ≈ "환불 정책" (multilingual models)
This is what powers RAG.
3. Where embeddings are used¶
- RAG search — the main job of Part 3
- Semantic search — internal wikis, code search
- Recommendations — finding similar products or content
- Deduplication — clustering similar documents
- Classification (zero-shot) — compare text to category embeddings
4. Minimal example — similarity of three sentences¶
- OpenAI's embedding model. Anthropic doesn't have a dedicated embedding API; Voyage AI is recommended instead.
- Cosine similarity: the angle between two vectors. Range: -1 to 1. Closer to 1 = more similar.
If "dog-dog" is near 0.9 and "dog-pizza" is near 0.3, your embedding model works well. If "dog-pizza" is above 0.5, the model is either too coarse or weak on English.
5. Hands-on¶
5.1 Choosing an embedding model¶
| Model | Dimension | Multilingual | Cost | Notes |
|---|---|---|---|---|
OpenAI text-embedding-3-small |
1536 | △ | $ | General purpose, budget-friendly |
OpenAI text-embedding-3-large |
3072 | ○ | $$$ | Better quality |
Voyage voyage-3 |
1024 | ○ | $$ | Anthropic-recommended |
| BGE-M3 (open source) | 1024 | ○○ | Free (local GPU) | Strong multilingual |
| sentence-transformers (open) | 384–768 | △ | Free | Lightweight, fast |
How to choose: - Heavy Korean content → BGE-M3 or Voyage - Prototyping → OpenAI small - GPU available → BGE-M3 locally - All-in Anthropic → Voyage (official recommendation)
5.2 Cosine similarity and normalization¶
- For millions of documents, this optimization matters.
np.dotgets GPU/BLAS acceleration.
5.3 Vector DB — hands-on with Chroma¶
Finding top-k from millions of documents goes to a vector database. Chroma is the easiest starting point:
- Persists to disk (
./chroma_db/). For memory-only, usechromadb.Client(). - Store documents, vectors, and metadata together. Metadata is used for filtering and citations.
- Query by embedding, get top-k. Distance is the metric (lower = closer). For cosine distance,
0 = exact match.
5.4 ANN — why not brute-force?¶
Finding top-10 from 1M documents means 1M cosine calculations. Too slow.
ANN (Approximate Nearest Neighbor) returns a near-perfect result fast:
- HNSW (Hierarchical Navigable Small World) — hierarchical graph traversal. Chroma and Qdrant use this by default.
- IVF (Inverted File) — narrow down clusters first
- PQ (Product Quantization) — compress vectors to save memory
At 99% accuracy, you get 100× speedup. For Part 3, just know "ANN exists" — your vector DB handles it.
5.5 Query vs. document embeddings — asymmetric¶
Embedding queries and documents with the same model usually works. But long documents vs. short queries benefit from newer models that support role hints:
Voyage example:
# For storage (documents)
client.embed(texts=docs, input_type="document")
# For search (queries)
client.embed(texts=[query], input_type="query")
This is called symmetric vs. asymmetric embedding. If accuracy gaps are large, use different roles for queries and documents.
6. Common pitfalls¶
Pitfall 1. Dimension mismatch
You embed with OpenAI small (1536 dims), store vectors, then switch to Voyage (1024 dims). Explosion. Or upgrade your model and dimensions change.
Fix: Record embedding_model and dimension in collection metadata. On model change, re-embed everything. Happens more often than you'd think.
Pitfall 2. Normalization confusion
Some models return L2 normalized (length=1), others don't. Using dot product on non-normalized vectors gives wrong values.
Fix: Force normalization post-embedding: vec / np.linalg.norm(vec). If your DB uses cosine distance, it handles this internally.
Pitfall 3. Mixed languages
Korean docs + English query (or vice versa) tanks quality unless you use a multilingual model.
Fix: Check multilingual benchmarks for Voyage, BGE-M3, or OpenAI's large model. For Korean-only, consider KURE or KoE5.
Pitfall 4. Over-interpreting short queries
A one-word query like "refund" has high variance in the embedding space — lots of noise. Unrelated docs creep into top-k.
Fix: (a) Use HyDE to generate hypothetical answers first (Ch 13), (b) Hybrid keyword+semantic search (Ch 12), (c) Prompt-based query expansion.
Pitfall 5. No metadata on save
Later you forget which doc a result came from, which section, who owns it. No citations. No permission filtering.
Fix: Always store metadata at save time: source, page, section, updated_at, owner.
7. Operations checklist¶
- Log embedding model · dimension · version in collection metadata
- Normalization strategy consistent across documents and queries
- Language distribution monitoring (query/doc language ratios) → guides model replacement
- Re-embedding script ready (batch operation for model upgrades)
- Metadata filtering active (
updated_at >= X,owner == Y) - Vector DB backups periodic (re-embedding at scale is expensive)
- Search latency monitored (p95 > 300ms? tune ANN parameters)
- PII in embeddings — be aware: personal info bakes into vectors (reconstruction attacks studied)
8. Exercises¶
- Take §4's
similarity.py, run it on 10 pairs (synonyms, antonyms, unrelated), and make a table of similarity scores - Add 5–10 FAQs to the Chroma example in §5.3 and test top-k quality with varied queries
- Run the same docs on both OpenAI small and Voyage-3 side by side. Compare top-3 result differences.
- Intentionally skip normalization on a vector. Calculate cosine similarity. See how much values drift.
- Mix English questions with Korean documents. Measure top-k quality. Feel the need for multilingual models.
9. Sources and further reading¶
- OpenAI Embeddings: platform.openai.com/docs/guides/embeddings
- Voyage AI: docs.voyageai.com — Anthropic's official recommendation
- BGE-M3 (multilingual open source): huggingface.co/BAAI/bge-m3
- Chroma: docs.trychroma.com
- HNSW paper — Malkov & Yashunin (2018), "Efficient and Robust ANN Search Using Hierarchical NSW Graphs"
- Stanford CME 295 Lec 7 — research summary in
_research/stanford-cme295.md
Next → Ch 11. The RAG Pipeline End-to-End
Embeddings plus vector DB are ingredients. Next: retrieval, ranking, citation — the full flow from document ingestion to grounded answers.