Embedding Model Evaluation

Key Points

💡 The senior question: "which embedding model do I pick for my domain?" The answer is never "the top of the MTEB leaderboard" — it is "whichever model wins on a small golden set drawn from my corpus."
The 2026 landscape: OpenAI text-embedding-3-large/-small, Cohere embed-v4 (multimodal), Voyage AI voyage-3-large and voyage-code (best-in-class general / code), Microsoft / open-weights E5, MxBai, BGE, Nomic.
Dimensions matter: 256-d vs 1024-d vs 3072-d trade off quality, memory, and search latency. OpenAI supports Matryoshka truncation via the dimensions parameter — same model, smaller vectors.
MTEB (Massive Text Embedding Benchmark) is the canonical public benchmark. Read the leaderboard with skepticism: top average ≠ best on your retrieval task.
Domain eval is non-negotiable. Build a 50-200 entry golden set of (query, gold_doc_id) pairs from your corpus; measure recall@K, MRR, NDCG@10.
Re-embedding cost when switching models can be the dominant migration line item: 100M docs × 500 tokens × $0.0001 / 1k tokens ≈ $5k just for OpenAI tier; weeks for self-hosted.
⚠️ Multilingual quality varies wildly per language; always evaluate per-locale.

Concepts (deep dive)

The 2026 model landscape

Family	Notable models	Strengths	Hosted / Open
OpenAI	`text-embedding-3-large` (3072-d), `-small` (1536-d)	Strong general baseline; Matryoshka truncation; cheap	Hosted
Cohere	`embed-v4`	Multimodal (text + image); strong multilingual	Hosted
Voyage AI	`voyage-3-large`, `voyage-code-3`, `voyage-finance-2`, `voyage-law-2`	Frequently top-of-MTEB; strong domain variants	Hosted
Microsoft / E5	`multilingual-e5-large`, `e5-mistral-7b-instruct`	Open-weights; good multilingual; strong instruction-tuned	Both
BGE (BAAI)	`bge-large-en-v1.5`, `bge-m3`	Open-weights; competitive English & multilingual	Open
Mixedbread	`mxbai-embed-large-v1`	Open-weights; small + capable	Open
Nomic	`nomic-embed-text-v1.5`	Open-weights; long context (8192 tokens)	Both
Jina AI	`jina-embeddings-v3`	Strong multilingual; long context	Both

Default safe pick for most enterprise .NET shops in 2026: text-embedding-3-large truncated to 1536-d via the dimensions parameter. It is cheap, well-supported, fast, and good enough until your eval says otherwise.

Dimensions and Matryoshka truncation

Full embedding (3072-d):  [v_0, v_1, v_2, ..., v_3071]
                          \__________________________/
                                    quality

Truncated (1024-d):       [v_0, v_1, v_2, ..., v_1023]   <-- still meaningful!
Truncated (512-d):        [v_0, v_1, v_2, ..., v_511]
Truncated (256-d):        [v_0, v_1, v_2, ..., v_255]

Matryoshka representation learning trains the model so that the first N dimensions are themselves a usable embedding. OpenAI's text-embedding-3-* and Nomic v1.5 support this; you pass dimensions=1024 and get a meaningful 1024-d vector with ~95% of the full-dim quality at a third of the storage and search cost.

Dim	Memory (1M docs)	Recall@10 (typical)	Use
256	1 GB	0.80-0.85	High-volume / cheap retrieval
512	2 GB	0.88-0.92	Default for "many tenants, modest corpora"
1024	4 GB	0.93-0.95	Default sweet spot
1536	6 GB	0.95-0.97	High-quality
3072	12 GB	0.96-0.97	Marginal gain over 1536 — usually not worth it

Reading the MTEB leaderboard correctly

MTEB scores models across ~60 tasks: retrieval, classification, clustering, semantic textual similarity (STS), reranking, bitext mining. The headline number is the average across tasks, which is misleading because:

Your task is retrieval, not the average. Filter the leaderboard to retrieval tasks only.
Your domain is your domain. A model that wins on Wikipedia QA may lose on legal contracts.
Synthetic vs real queries. Many MTEB tasks are synthetic; your users are not.
Rapid SOTA churn. The leader changes monthly; the capable set changes yearly.

Use MTEB as a candidate list, not a verdict.

Domain evaluation: build the golden set

+--------------------------------+
| Golden set: 100 entries        |
|  query → list of relevant docs |
+--------------------------------+

Sources:
  - real user queries (logs, support tickets) ← best
  - SME-curated examples
  - hard cases (ambiguous, multi-hop, rare terms)
  - adversarial (typos, code-switching, jargon)

For each (query, relevant_doc_ids) you measure:

Metric	Definition	Use
Hit rate @ K	"Did any relevant doc appear in top-K?"	Sanity-check; binary success
Recall @ K	"What fraction of relevant docs are in top-K?"	Most important for RAG quality
MRR (Mean Reciprocal Rank)	"1 / rank of first relevant doc, averaged"	Rewards ranking the truth high
NDCG @ K	Normalized Discounted Cumulative Gain	Gold standard; rewards graded relevance
Precision @ K	"Fraction of top-K that are relevant"	Less useful when K is small and you have multiple gold docs

For RAG, recall@10 is usually the headline metric — your LLM downstream needs the relevant chunk in its context window.

Cost dimension

Hosted: OpenAI text-embedding-3-small ~ $0.00002 / 1k tokens; -large ~ $0.00013 / 1k tokens. Cohere embed-v4 and Voyage roughly comparable.
Self-hosted: amortized cost is GPU-hours. A bge-large-en-v1.5 on an A10 does ~5k embeddings/sec. Break-even vs hosted at ~50M-100M embeddings / month.
Re-embedding cost: irreversible decision. 100M docs × ~500 tokens × $0.00013/1k = $6.5k. Plan for it before model swaps.

Multilingual

multilingual-e5-large and bge-m3 are competitive open-weights for multilingual.
Cohere embed-multilingual-v3.0 is the hosted leader for many language pairs.
⚠️ Per-language quality varies. A model that scores 0.90 on English may score 0.65 on Tagalog. Build a per-language eval set if you serve >2 languages.

How it works under the hood

Embedding models map text → fixed-length dense vectors such that semantically similar text → nearby vectors (cosine or dot-product close).

"How do I cancel my subscription?"
        │
        v
+----------------+
| Tokenizer      |  → token ids
+----------------+
        │
        v
+----------------+
| Transformer    |  → contextual hidden states
| (encoder-only) |    [seq_len × hidden_dim]
+----------------+
        │
        v
+----------------+
| Pooling        |  → single vector
| (mean / CLS)   |    [hidden_dim]
+----------------+
        │
        v
+----------------+
| Normalize      |  → unit vector for cosine
| (optional      |    [dim]
|  L2)           |
+----------------+

Modern embedding models are encoder-only transformers fine-tuned with contrastive learning: pairs of (query, positive doc) are pulled close in vector space; pairs of (query, hard negative) pushed apart. The "instruction-tuned" variants (E5, Voyage) prepend a task prompt — "query: ..." for queries, "passage: ..." for documents — which changes the vector and must be applied consistently at index and query time.

Code: correct vs wrong

✅ Correct: A/B eval harness in .NET

using Microsoft.Extensions.AI;

record GoldenEntry(string Query, string[] RelevantDocIds);

async Task<double> RecallAtK(
    IEmbeddingGenerator<string, Embedding<float>> emb,
    IReadOnlyList<(string DocId, string Text)> corpus,
    IReadOnlyList<GoldenEntry> golden,
    int k = 10)
{
    // Pre-embed the corpus once per model.
    var corpusEmb = (await emb.GenerateAsync(corpus.Select(c => c.Text)))
        .Zip(corpus, (e, c) => (c.DocId, Vec: e.Vector))
        .ToList();

    int totalRelevant = 0, foundRelevant = 0;
    foreach (var entry in golden)
    {
        var q = (await emb.GenerateAsync([entry.Query]))[0].Vector;
        var top = corpusEmb
            .Select(c => (c.DocId, Sim: Cosine(q.Span, c.Vec.Span)))
            .OrderByDescending(x => x.Sim)
            .Take(k)
            .Select(x => x.DocId)
            .ToHashSet();

        totalRelevant  += entry.RelevantDocIds.Length;
        foundRelevant  += entry.RelevantDocIds.Count(top.Contains);
    }
    return (double)foundRelevant / totalRelevant;
}

// Compare two candidates head-to-head.
var openAi  = new OpenAIClient(key).GetEmbeddingClient("text-embedding-3-small")
                                   .AsIEmbeddingGenerator();
var voyage  = /* Voyage SDK adapter */;

var rOpenAi = await RecallAtK(openAi, corpus, golden);
var rVoyage = await RecallAtK(voyage, corpus, golden);

Console.WriteLine($"OpenAI-3-small recall@10: {rOpenAi:P2}");
Console.WriteLine($"Voyage-3-large  recall@10: {rVoyage:P2}");

❌ Wrong: comparing models on a synthetic eval set you generated from one of them

// You asked GPT-4o to write 100 queries about your corpus.
// You're now using those queries to grade embedding models. Bias incoming.
var synthetic = await llm.GenerateQueriesAsync(corpus);
var recall = await RecallAtK(openAi, corpus, synthetic);

Synthetic queries inherit the style and vocabulary of the generator. They drift from real users and inflate the score for embedding models trained on similar distribution.

✅ Correct: instruction prefixes for E5 / Voyage

// Instruction-tuned models REQUIRE different prefixes for queries vs passages.
string FormatForIndex(string text) => $"passage: {text}";
string FormatForQuery(string text) => $"query: {text}";

var docVecs = await emb.GenerateAsync(corpus.Select(c => FormatForIndex(c.Text)));
var qVec    = (await emb.GenerateAsync([FormatForQuery(userQuery)]))[0].Vector;

❌ Wrong: same prefix for queries and passages

// Silent recall drop of 10-20%. Both vectors live in the wrong subspace.
var qVec = (await emb.GenerateAsync([userQuery]))[0].Vector;
var docVecs = await emb.GenerateAsync(corpus.Select(c => c.Text));

✅ Correct: Matryoshka truncation via OpenAI `dimensions`

var emb = new OpenAIClient(key)
    .GetEmbeddingClient("text-embedding-3-large")
    .AsIEmbeddingGenerator(new EmbeddingGenerationOptions
    {
        Dimensions = 1024     // truncate from 3072 → 1024 with minimal quality loss
    });

Design patterns for this topic

Pattern 1 — "Golden-set bake-off"

Intent: pick the model from data, not vibes.
Tactics: 100-entry golden set; measure recall@10, MRR, NDCG@10 across 3-5 candidates; pick winner; re-run when models drop.

Pattern 2 — "Truncate first, scale later"

Intent: spend storage where it matters.
Tactics: start with text-embedding-3-large truncated to 1024-d. If recall drops below target on your eval, raise to 1536 or 3072 before swapping models.

Pattern 3 — "Versioned indexes for migration"

Intent: swap embedding model with zero downtime.
Tactics: see Vector DB Tuning. Build docs_v2 next to docs_v1; backfill; flip reads atomically.

Pattern 4 — "Per-domain models for mixed corpora"

Intent: code search and contract search shouldn't share a model.
Tactics: separate index per domain; route at query time. voyage-code-3 for code, voyage-3-large for prose.

Pattern 5 — "Self-host break-even calculator"

Intent: decide hosted vs self-hosted with math.
Tactics: monthly token volume × hosted price vs amortized GPU cost (A10 / L4 instance + ops). Switch when self-hosted is <60% of hosted line item including ops headcount.

Pros & cons / trade-offs

Aspect	Hosted (OpenAI / Cohere / Voyage)	Open-weights / self-hosted (E5, BGE, MxBai)
Quality ceiling	Top-of-MTEB on average	Competitive on domain after fine-tune
Latency	Network RTT + provider	Co-located GPU; <10 ms
Cost at scale	Linear with tokens	Amortized GPU; cheap at >50M tokens/mo
Data residency	Sent to vendor	On-prem / VPC
Maintenance	None	Container, GPU pool, monitoring
Customization	None	Domain fine-tuning possible

When to use / when to avoid

Re-evaluate / swap models when: - Your eval set's recall@10 has plateaued or regressed and a newer model has appeared. - Your domain shifted (added languages, added code, added images). - The hosted vendor deprecated your current model.

Don't re-evaluate when: - You don't have a golden set yet — build that first. - The retrieval bottleneck is upstream (chunking, query rewriting) — embedding swap won't fix it. - Re-embedding cost > expected quality gain × user value. Do the math.

Interview Q&A

1. Walk me through how you'd pick an embedding model for a new RAG project. Build a 50-100 entry golden set from real or synthesized user queries, with SME-labelled gold doc IDs. Pick a candidate list (OpenAI 3-large, Voyage 3-large, Cohere v3, plus one open-weights like bge-m3 if data residency matters). For each candidate, embed the corpus and the queries, compute recall@10 and NDCG@10. Pick the winner. Document cost-per-query and latency next to quality so the trade-off is explicit. Re-run quarterly or whenever a major model drops.

2. Why is "top of MTEB" not the right answer? MTEB averages across ~60 tasks of varying relevance to your problem. Your task is retrieval on your corpus, in your language, on your style of query. A model that wins overall MTEB may lose on legal contracts or rare-language support. MTEB is a candidate-list filter, not a verdict — always validate on a domain golden set.

3. What is Matryoshka representation learning and why does it matter? Models trained with Matryoshka loss produce embeddings where the first N dimensions are themselves a usable embedding. OpenAI's text-embedding-3-* family supports this: ask for dimensions=1024 instead of the full 3072 and you get a meaningful 1024-d vector with ~95% of full-dim quality at a third of the storage and search cost. It is the cheapest quality knob in RAG.

4. The instruction prefix trap — what is it? Instruction-tuned embedding models (E5, Voyage) expect different prefixes for queries vs documents — "query: ..." and "passage: ...". If you use the same prefix (or none) on both sides, the query and document vectors live in slightly misaligned subspaces and recall silently drops 10-20%. Always check the model card and apply the prefixes consistently at index and query time.

5. Recall@K vs MRR vs NDCG — when each? Recall@K is the bluntest and most useful for RAG: "did I get the relevant doc into the LLM's context?". MRR rewards ranking the truth high — useful when you only show a top-1 result. NDCG@K is the gold standard when you have graded relevance ("highly relevant" vs "marginally relevant") because it weights position and grade together. For RAG with binary relevance, recall@10 and MRR are sufficient.

6. How big should a golden set be? 50 to start, 100-200 for confidence, 500+ for stable rankings of close-quality models. Stratified across feature areas, languages, and difficulty. Quality matters more than size — a 50-entry set hand-curated by SMEs beats a 500-entry synthetic set every time.

7. What is the cost of switching embedding models on a 50M-doc corpus? Two costs. (1) Re-embedding: 50M docs × ~500 tokens × $0.0001/1k tokens = ~$2.5k for OpenAI tier; days of API time at sane rate limits. (2) Re-indexing: storage cost for v2 alongside v1 during migration; engineering time for the dual-write / dual-read pattern. Plan for both before signing off on the swap.

8. When does fine-tuning an embedding model pay off? When you have (a) a domain whose vocabulary the base models under-fit (legal, biomedical, deep technical), (b) labelled positive/negative pairs from real usage, and © volume to amortize the engineering cost. For most enterprise .NET shops the answer is don't fine-tune — pick a stronger base model or a domain-pretrained variant (voyage-finance-2, voyage-law-2) instead.

9. What is the "hit rate" metric and why is it weak? Hit rate @ K is binary: did any gold doc appear in top-K? It's easy to compute and gives a quick sanity check. It is weak because it ignores rank position (a doc at rank 1 vs rank 10 looks the same) and ignores the case where you have multiple gold docs and only got one. Use it for smoke tests; promote to recall / MRR / NDCG for real comparisons.

10. Why does multilingual matter even for English-first products? Real users code-switch, paste foreign-language quotes, and submit names in non-Latin scripts. An English-only model degrades on these queries. If even 5% of your traffic is non-English, run a multilingual model (or detect language and route) and evaluate per-locale. The cost difference between text-embedding-3-small and multilingual-e5-large is negligible vs the support burden of "search doesn't work for our European tenants".

11. How do you detect that your embedding model has degraded after deployment? Three signals. (1) Eval drift: re-run the golden set monthly; alert on recall@10 regression. (2) User signals: thumbs-down rate, regenerate-rate, time-to-accept. (3) Distribution shift: track the cosine-similarity distribution of top-1 results — if it shifts, your embedding space changed (silent vendor model upgrade, or input distribution shift).

12. Open-weights vs hosted — what is the senior take? Hosted is the default in 2026 for time-to-market. Move to open-weights when (a) data residency forbids egress, (b) per-token cost dominates your bill, © you need fine-tuning, or (d) you want predictable behaviour across vendor model upgrades. Don't underestimate ops cost: a self-hosted A10 GPU embedding service is one more thing to monitor, scale, and patch.

Gotchas / common mistakes

⚠️ Trusting MTEB blindly. Top average ≠ winner on your retrieval task.
⚠️ Evaluating on synthetic queries the same model wrote. Confirmation bias.
⚠️ Forgetting instruction prefixes. Silent 10-20% recall drop on E5 / Voyage.
⚠️ Using text-embedding-ada-002 in 2026. Deprecated; replace with text-embedding-3-small or -large.
⚠️ Different models at index and query time. Vectors live in different spaces; recall collapses.
⚠️ Not re-evaluating after a vendor model upgrade. Hosted models change behind moving aliases. Pin to dated snapshots when available.
⚠️ Picking 3072-d "because more is better". Marginal recall gain over 1024-d, 3× storage and search cost.
⚠️ Skipping per-locale eval. A model that scores 0.90 in English may score 0.65 in Polish.
⚠️ No re-embedding budget. Model swap on a 100M-doc corpus is a multi-thousand-dollar line item; plan it.
⚠️ Mixing distance functions. Embedding model trained for cosine, indexed with L2 → degraded recall, no clear failure signal.

Embedding Model Evaluation

Key Points

Concepts (deep dive)

The 2026 model landscape

Dimensions and Matryoshka truncation

Reading the MTEB leaderboard correctly

Domain evaluation: build the golden set

Cost dimension

Multilingual

How it works under the hood

Code: correct vs wrong

✅ Correct: A/B eval harness in .NET

❌ Wrong: comparing models on a synthetic eval set you generated from one of them

✅ Correct: instruction prefixes for E5 / Voyage

❌ Wrong: same prefix for queries and passages

✅ Correct: Matryoshka truncation via OpenAI `dimensions`

Design patterns for this topic

Pattern 1 — "Golden-set bake-off"

Pattern 2 — "Truncate first, scale later"

Pattern 3 — "Versioned indexes for migration"

Pattern 4 — "Per-domain models for mixed corpora"

Pattern 5 — "Self-host break-even calculator"

Pros & cons / trade-offs

When to use / when to avoid

Interview Q&A

Gotchas / common mistakes

Further reading

Cross-references

Embedding Model Evaluation

Key Points

Concepts (deep dive)

The 2026 model landscape

Dimensions and Matryoshka truncation

Reading the MTEB leaderboard correctly

Domain evaluation: build the golden set

Cost dimension

Multilingual

How it works under the hood

Code: correct vs wrong

✅ Correct: A/B eval harness in .NET

❌ Wrong: comparing models on a synthetic eval set you generated from one of them

✅ Correct: instruction prefixes for E5 / Voyage

❌ Wrong: same prefix for queries and passages

✅ Correct: Matryoshka truncation via OpenAI dimensions

Design patterns for this topic

Pattern 1 — "Golden-set bake-off"

Pattern 2 — "Truncate first, scale later"

Pattern 3 — "Versioned indexes for migration"

Pattern 4 — "Per-domain models for mixed corpora"

Pattern 5 — "Self-host break-even calculator"

Pros & cons / trade-offs

When to use / when to avoid

Interview Q&A

Gotchas / common mistakes

Further reading

Cross-references

✅ Correct: Matryoshka truncation via OpenAI `dimensions`