Embedding Model Evaluation
Key Points
- 💡 The senior question: "which embedding model do I pick for my domain?" The answer is never "the top of the MTEB leaderboard" — it is "whichever model wins on a small golden set drawn from my corpus."
- The 2026 landscape: OpenAI
text-embedding-3-large/-small, Cohereembed-v4(multimodal), Voyage AIvoyage-3-largeandvoyage-code(best-in-class general / code), Microsoft / open-weights E5, MxBai, BGE, Nomic. - Dimensions matter: 256-d vs 1024-d vs 3072-d trade off quality, memory, and search latency. OpenAI supports Matryoshka truncation via the
dimensionsparameter — same model, smaller vectors. - MTEB (Massive Text Embedding Benchmark) is the canonical public benchmark. Read the leaderboard with skepticism: top average ≠ best on your retrieval task.
- Domain eval is non-negotiable. Build a 50-200 entry golden set of
(query, gold_doc_id)pairs from your corpus; measure recall@K, MRR, NDCG@10. - Re-embedding cost when switching models can be the dominant migration line item: 100M docs × 500 tokens × $0.0001 / 1k tokens ≈ $5k just for OpenAI tier; weeks for self-hosted.
- ⚠️ Multilingual quality varies wildly per language; always evaluate per-locale.
Concepts (deep dive)
The 2026 model landscape
| Family | Notable models | Strengths | Hosted / Open |
|---|---|---|---|
| OpenAI | text-embedding-3-large (3072-d), -small (1536-d) | Strong general baseline; Matryoshka truncation; cheap | Hosted |
| Cohere | embed-v4 | Multimodal (text + image); strong multilingual | Hosted |
| Voyage AI | voyage-3-large, voyage-code-3, voyage-finance-2, voyage-law-2 | Frequently top-of-MTEB; strong domain variants | Hosted |
| Microsoft / E5 | multilingual-e5-large, e5-mistral-7b-instruct | Open-weights; good multilingual; strong instruction-tuned | Both |
| BGE (BAAI) | bge-large-en-v1.5, bge-m3 | Open-weights; competitive English & multilingual | Open |
| Mixedbread | mxbai-embed-large-v1 | Open-weights; small + capable | Open |
| Nomic | nomic-embed-text-v1.5 | Open-weights; long context (8192 tokens) | Both |
| Jina AI | jina-embeddings-v3 | Strong multilingual; long context | Both |
Default safe pick for most enterprise .NET shops in 2026: text-embedding-3-large truncated to 1536-d via the dimensions parameter. It is cheap, well-supported, fast, and good enough until your eval says otherwise.
Dimensions and Matryoshka truncation
Full embedding (3072-d): [v_0, v_1, v_2, ..., v_3071]
\__________________________/
quality
Truncated (1024-d): [v_0, v_1, v_2, ..., v_1023] <-- still meaningful!
Truncated (512-d): [v_0, v_1, v_2, ..., v_511]
Truncated (256-d): [v_0, v_1, v_2, ..., v_255]
Matryoshka representation learning trains the model so that the first N dimensions are themselves a usable embedding. OpenAI's text-embedding-3-* and Nomic v1.5 support this; you pass dimensions=1024 and get a meaningful 1024-d vector with ~95% of the full-dim quality at a third of the storage and search cost.
| Dim | Memory (1M docs) | Recall@10 (typical) | Use |
|---|---|---|---|
| 256 | 1 GB | 0.80-0.85 | High-volume / cheap retrieval |
| 512 | 2 GB | 0.88-0.92 | Default for "many tenants, modest corpora" |
| 1024 | 4 GB | 0.93-0.95 | Default sweet spot |
| 1536 | 6 GB | 0.95-0.97 | High-quality |
| 3072 | 12 GB | 0.96-0.97 | Marginal gain over 1536 — usually not worth it |
Reading the MTEB leaderboard correctly
MTEB scores models across ~60 tasks: retrieval, classification, clustering, semantic textual similarity (STS), reranking, bitext mining. The headline number is the average across tasks, which is misleading because:
- Your task is retrieval, not the average. Filter the leaderboard to retrieval tasks only.
- Your domain is your domain. A model that wins on Wikipedia QA may lose on legal contracts.
- Synthetic vs real queries. Many MTEB tasks are synthetic; your users are not.
- Rapid SOTA churn. The leader changes monthly; the capable set changes yearly.
Use MTEB as a candidate list, not a verdict.
Domain evaluation: build the golden set
+--------------------------------+
| Golden set: 100 entries |
| query → list of relevant docs |
+--------------------------------+
Sources:
- real user queries (logs, support tickets) ← best
- SME-curated examples
- hard cases (ambiguous, multi-hop, rare terms)
- adversarial (typos, code-switching, jargon)
For each (query, relevant_doc_ids) you measure:
| Metric | Definition | Use |
|---|---|---|
| Hit rate @ K | "Did any relevant doc appear in top-K?" | Sanity-check; binary success |
| Recall @ K | "What fraction of relevant docs are in top-K?" | Most important for RAG quality |
| MRR (Mean Reciprocal Rank) | "1 / rank of first relevant doc, averaged" | Rewards ranking the truth high |
| NDCG @ K | Normalized Discounted Cumulative Gain | Gold standard; rewards graded relevance |
| Precision @ K | "Fraction of top-K that are relevant" | Less useful when K is small and you have multiple gold docs |
For RAG, recall@10 is usually the headline metric — your LLM downstream needs the relevant chunk in its context window.
Cost dimension
- Hosted: OpenAI
text-embedding-3-small~ $0.00002 / 1k tokens;-large~ $0.00013 / 1k tokens. Cohereembed-v4and Voyage roughly comparable. - Self-hosted: amortized cost is GPU-hours. A
bge-large-en-v1.5on an A10 does ~5k embeddings/sec. Break-even vs hosted at ~50M-100M embeddings / month. - Re-embedding cost: irreversible decision. 100M docs × ~500 tokens ×
$0.00013/1k= $6.5k. Plan for it before model swaps.
Multilingual
multilingual-e5-largeandbge-m3are competitive open-weights for multilingual.- Cohere
embed-multilingual-v3.0is the hosted leader for many language pairs. - ⚠️ Per-language quality varies. A model that scores 0.90 on English may score 0.65 on Tagalog. Build a per-language eval set if you serve >2 languages.
How it works under the hood
Embedding models map text → fixed-length dense vectors such that semantically similar text → nearby vectors (cosine or dot-product close).
"How do I cancel my subscription?"
│
v
+----------------+
| Tokenizer | → token ids
+----------------+
│
v
+----------------+
| Transformer | → contextual hidden states
| (encoder-only) | [seq_len × hidden_dim]
+----------------+
│
v
+----------------+
| Pooling | → single vector
| (mean / CLS) | [hidden_dim]
+----------------+
│
v
+----------------+
| Normalize | → unit vector for cosine
| (optional | [dim]
| L2) |
+----------------+
Modern embedding models are encoder-only transformers fine-tuned with contrastive learning: pairs of (query, positive doc) are pulled close in vector space; pairs of (query, hard negative) pushed apart. The "instruction-tuned" variants (E5, Voyage) prepend a task prompt — "query: ..." for queries, "passage: ..." for documents — which changes the vector and must be applied consistently at index and query time.
Code: correct vs wrong
✅ Correct: A/B eval harness in .NET
using Microsoft.Extensions.AI;
record GoldenEntry(string Query, string[] RelevantDocIds);
async Task<double> RecallAtK(
IEmbeddingGenerator<string, Embedding<float>> emb,
IReadOnlyList<(string DocId, string Text)> corpus,
IReadOnlyList<GoldenEntry> golden,
int k = 10)
{
// Pre-embed the corpus once per model.
var corpusEmb = (await emb.GenerateAsync(corpus.Select(c => c.Text)))
.Zip(corpus, (e, c) => (c.DocId, Vec: e.Vector))
.ToList();
int totalRelevant = 0, foundRelevant = 0;
foreach (var entry in golden)
{
var q = (await emb.GenerateAsync([entry.Query]))[0].Vector;
var top = corpusEmb
.Select(c => (c.DocId, Sim: Cosine(q.Span, c.Vec.Span)))
.OrderByDescending(x => x.Sim)
.Take(k)
.Select(x => x.DocId)
.ToHashSet();
totalRelevant += entry.RelevantDocIds.Length;
foundRelevant += entry.RelevantDocIds.Count(top.Contains);
}
return (double)foundRelevant / totalRelevant;
}
// Compare two candidates head-to-head.
var openAi = new OpenAIClient(key).GetEmbeddingClient("text-embedding-3-small")
.AsIEmbeddingGenerator();
var voyage = /* Voyage SDK adapter */;
var rOpenAi = await RecallAtK(openAi, corpus, golden);
var rVoyage = await RecallAtK(voyage, corpus, golden);
Console.WriteLine($"OpenAI-3-small recall@10: {rOpenAi:P2}");
Console.WriteLine($"Voyage-3-large recall@10: {rVoyage:P2}");
❌ Wrong: comparing models on a synthetic eval set you generated from one of them
// You asked GPT-4o to write 100 queries about your corpus.
// You're now using those queries to grade embedding models. Bias incoming.
var synthetic = await llm.GenerateQueriesAsync(corpus);
var recall = await RecallAtK(openAi, corpus, synthetic);
Synthetic queries inherit the style and vocabulary of the generator. They drift from real users and inflate the score for embedding models trained on similar distribution.
✅ Correct: instruction prefixes for E5 / Voyage
// Instruction-tuned models REQUIRE different prefixes for queries vs passages.
string FormatForIndex(string text) => $"passage: {text}";
string FormatForQuery(string text) => $"query: {text}";
var docVecs = await emb.GenerateAsync(corpus.Select(c => FormatForIndex(c.Text)));
var qVec = (await emb.GenerateAsync([FormatForQuery(userQuery)]))[0].Vector;
❌ Wrong: same prefix for queries and passages
// Silent recall drop of 10-20%. Both vectors live in the wrong subspace.
var qVec = (await emb.GenerateAsync([userQuery]))[0].Vector;
var docVecs = await emb.GenerateAsync(corpus.Select(c => c.Text));
✅ Correct: Matryoshka truncation via OpenAI dimensions
var emb = new OpenAIClient(key)
.GetEmbeddingClient("text-embedding-3-large")
.AsIEmbeddingGenerator(new EmbeddingGenerationOptions
{
Dimensions = 1024 // truncate from 3072 → 1024 with minimal quality loss
});
Design patterns for this topic
Pattern 1 — "Golden-set bake-off"
- Intent: pick the model from data, not vibes.
- Tactics: 100-entry golden set; measure recall@10, MRR, NDCG@10 across 3-5 candidates; pick winner; re-run when models drop.
Pattern 2 — "Truncate first, scale later"
- Intent: spend storage where it matters.
- Tactics: start with
text-embedding-3-largetruncated to 1024-d. If recall drops below target on your eval, raise to 1536 or 3072 before swapping models.
Pattern 3 — "Versioned indexes for migration"
- Intent: swap embedding model with zero downtime.
- Tactics: see Vector DB Tuning. Build
docs_v2next todocs_v1; backfill; flip reads atomically.
Pattern 4 — "Per-domain models for mixed corpora"
- Intent: code search and contract search shouldn't share a model.
- Tactics: separate index per domain; route at query time.
voyage-code-3for code,voyage-3-largefor prose.
Pattern 5 — "Self-host break-even calculator"
- Intent: decide hosted vs self-hosted with math.
- Tactics: monthly token volume × hosted price vs amortized GPU cost (A10 / L4 instance + ops). Switch when self-hosted is <60% of hosted line item including ops headcount.
Pros & cons / trade-offs
| Aspect | Hosted (OpenAI / Cohere / Voyage) | Open-weights / self-hosted (E5, BGE, MxBai) |
|---|---|---|
| Quality ceiling | Top-of-MTEB on average | Competitive on domain after fine-tune |
| Latency | Network RTT + provider | Co-located GPU; <10 ms |
| Cost at scale | Linear with tokens | Amortized GPU; cheap at >50M tokens/mo |
| Data residency | Sent to vendor | On-prem / VPC |
| Maintenance | None | Container, GPU pool, monitoring |
| Customization | None | Domain fine-tuning possible |
When to use / when to avoid
Re-evaluate / swap models when: - Your eval set's recall@10 has plateaued or regressed and a newer model has appeared. - Your domain shifted (added languages, added code, added images). - The hosted vendor deprecated your current model.
Don't re-evaluate when: - You don't have a golden set yet — build that first. - The retrieval bottleneck is upstream (chunking, query rewriting) — embedding swap won't fix it. - Re-embedding cost > expected quality gain × user value. Do the math.
Interview Q&A
1. Walk me through how you'd pick an embedding model for a new RAG project. Build a 50-100 entry golden set from real or synthesized user queries, with SME-labelled gold doc IDs. Pick a candidate list (OpenAI 3-large, Voyage 3-large, Cohere v3, plus one open-weights like bge-m3 if data residency matters). For each candidate, embed the corpus and the queries, compute recall@10 and NDCG@10. Pick the winner. Document cost-per-query and latency next to quality so the trade-off is explicit. Re-run quarterly or whenever a major model drops.
2. Why is "top of MTEB" not the right answer? MTEB averages across ~60 tasks of varying relevance to your problem. Your task is retrieval on your corpus, in your language, on your style of query. A model that wins overall MTEB may lose on legal contracts or rare-language support. MTEB is a candidate-list filter, not a verdict — always validate on a domain golden set.
3. What is Matryoshka representation learning and why does it matter? Models trained with Matryoshka loss produce embeddings where the first N dimensions are themselves a usable embedding. OpenAI's text-embedding-3-* family supports this: ask for dimensions=1024 instead of the full 3072 and you get a meaningful 1024-d vector with ~95% of full-dim quality at a third of the storage and search cost. It is the cheapest quality knob in RAG.
4. The instruction prefix trap — what is it? Instruction-tuned embedding models (E5, Voyage) expect different prefixes for queries vs documents — "query: ..." and "passage: ...". If you use the same prefix (or none) on both sides, the query and document vectors live in slightly misaligned subspaces and recall silently drops 10-20%. Always check the model card and apply the prefixes consistently at index and query time.
5. Recall@K vs MRR vs NDCG — when each? Recall@K is the bluntest and most useful for RAG: "did I get the relevant doc into the LLM's context?". MRR rewards ranking the truth high — useful when you only show a top-1 result. NDCG@K is the gold standard when you have graded relevance ("highly relevant" vs "marginally relevant") because it weights position and grade together. For RAG with binary relevance, recall@10 and MRR are sufficient.
6. How big should a golden set be? 50 to start, 100-200 for confidence, 500+ for stable rankings of close-quality models. Stratified across feature areas, languages, and difficulty. Quality matters more than size — a 50-entry set hand-curated by SMEs beats a 500-entry synthetic set every time.
7. What is the cost of switching embedding models on a 50M-doc corpus? Two costs. (1) Re-embedding: 50M docs × ~500 tokens × $0.0001/1k tokens = ~$2.5k for OpenAI tier; days of API time at sane rate limits. (2) Re-indexing: storage cost for v2 alongside v1 during migration; engineering time for the dual-write / dual-read pattern. Plan for both before signing off on the swap.
8. When does fine-tuning an embedding model pay off? When you have (a) a domain whose vocabulary the base models under-fit (legal, biomedical, deep technical), (b) labelled positive/negative pairs from real usage, and © volume to amortize the engineering cost. For most enterprise .NET shops the answer is don't fine-tune — pick a stronger base model or a domain-pretrained variant (voyage-finance-2, voyage-law-2) instead.
9. What is the "hit rate" metric and why is it weak? Hit rate @ K is binary: did any gold doc appear in top-K? It's easy to compute and gives a quick sanity check. It is weak because it ignores rank position (a doc at rank 1 vs rank 10 looks the same) and ignores the case where you have multiple gold docs and only got one. Use it for smoke tests; promote to recall / MRR / NDCG for real comparisons.
10. Why does multilingual matter even for English-first products? Real users code-switch, paste foreign-language quotes, and submit names in non-Latin scripts. An English-only model degrades on these queries. If even 5% of your traffic is non-English, run a multilingual model (or detect language and route) and evaluate per-locale. The cost difference between text-embedding-3-small and multilingual-e5-large is negligible vs the support burden of "search doesn't work for our European tenants".
11. How do you detect that your embedding model has degraded after deployment? Three signals. (1) Eval drift: re-run the golden set monthly; alert on recall@10 regression. (2) User signals: thumbs-down rate, regenerate-rate, time-to-accept. (3) Distribution shift: track the cosine-similarity distribution of top-1 results — if it shifts, your embedding space changed (silent vendor model upgrade, or input distribution shift).
12. Open-weights vs hosted — what is the senior take? Hosted is the default in 2026 for time-to-market. Move to open-weights when (a) data residency forbids egress, (b) per-token cost dominates your bill, © you need fine-tuning, or (d) you want predictable behaviour across vendor model upgrades. Don't underestimate ops cost: a self-hosted A10 GPU embedding service is one more thing to monitor, scale, and patch.
Gotchas / common mistakes
- ⚠️ Trusting MTEB blindly. Top average ≠ winner on your retrieval task.
- ⚠️ Evaluating on synthetic queries the same model wrote. Confirmation bias.
- ⚠️ Forgetting instruction prefixes. Silent 10-20% recall drop on E5 / Voyage.
- ⚠️ Using
text-embedding-ada-002in 2026. Deprecated; replace withtext-embedding-3-smallor-large. - ⚠️ Different models at index and query time. Vectors live in different spaces; recall collapses.
- ⚠️ Not re-evaluating after a vendor model upgrade. Hosted models change behind moving aliases. Pin to dated snapshots when available.
- ⚠️ Picking 3072-d "because more is better". Marginal recall gain over 1024-d, 3× storage and search cost.
- ⚠️ Skipping per-locale eval. A model that scores 0.90 in English may score 0.65 in Polish.
- ⚠️ No re-embedding budget. Model swap on a 100M-doc corpus is a multi-thousand-dollar line item; plan it.
- ⚠️ Mixing distance functions. Embedding model trained for cosine, indexed with L2 → degraded recall, no clear failure signal.
Further reading
- MTEB leaderboard (HuggingFace)
- OpenAI: New embedding models and API updates (Matryoshka)
- Cohere Embed v4
- Voyage AI documentation
- BGE M3-Embedding (paper)
- E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training
- Microsoft.Extensions.AI: IEmbeddingGenerator
- Evaluate retrieval quality with Microsoft.Extensions.AI.Evaluation
- Matryoshka Representation Learning (paper)
API note: OpenAI's
dimensionsparameter only works ontext-embedding-3-smallandtext-embedding-3-large— older models (ada-002) do not support truncation. The Voyage and Cohere SDKs are community / REST in .NET; check Microsoft Learn for first-partyIEmbeddingGeneratoradapters before pinning a version.