Hybrid Search & Reranking

Key Points

Hybrid search = vector + keyword (BM25). Better than either alone for most queries.
Reciprocal Rank Fusion (RRF) is the standard fusion algorithm.
Reranking: take top-K from initial search; rerank with cross-encoder. Higher quality top-3.
Cohere Rerank / Jina Reranker / open-source cross-encoders.
Hybrid + reranker = SOTA RAG retrieval.

Why hybrid

Vector search excels at: - Semantic similarity ("dog" matches "puppy"). - Multi-language.

Keyword search excels at: - Exact terms ("OAuth2", "PCI-DSS"). - Acronyms. - Numbers / IDs.

Together: complementary.

Reciprocal Rank Fusion

RRF score(d) = Σ (1 / (k + rank_i(d)))

For each ranking system, take 1/(k+rank). Sum across systems. Higher = better.

k is a constant (60 typical). Insensitive to absolute scores.

Code (Azure AI Search hybrid)

await foreach (var r in collection.HybridSearchAsync(new HybridSearchRequest<MyDoc>
{
    VectorQuery = queryEmb,
    KeywordQuery = userQuery,
    Top = 50
})) { /* ... */ }

Built-in. Best quality.

Manual hybrid

For stores without native:

var vectorResults = await collection.SearchAsync(queryEmb, top: 50);
var keywordResults = await keywordSearch.QueryAsync(userQuery, top: 50);

// RRF fusion
var fused = new Dictionary<string, double>();
foreach (var (r, i) in vectorResults.Select((r, i) => (r, i)))
    fused[r.Id] = fused.GetValueOrDefault(r.Id) + 1.0 / (60 + i);
foreach (var (r, i) in keywordResults.Select((r, i) => (r, i)))
    fused[r.Id] = fused.GetValueOrDefault(r.Id) + 1.0 / (60 + i);

var top = fused.OrderByDescending(kv => kv.Value).Take(10);

Reranking

After initial retrieval:

var topK = await collection.SearchAsync(queryEmb, top: 50).ToListAsync();
var reranked = await _reranker.RerankAsync(query, topK.Select(r => r.Record.Text).ToArray(), top: 5);

Reranker is a cross-encoder model: takes (query, document) pair and outputs a relevance score. More accurate than cosine similarity but slower.

Reranker providers

Provider	Notes
Cohere Rerank (rerank-v3)	Best quality; commercial
Jina Reranker	Strong; commercial
BGE-Reranker (open)	Self-host; HuggingFace
Sentence-Transformers cross-encoders	Self-host

Cohere Rerank example

var cohere = new CohereClient(apiKey);
var resp = await cohere.RerankAsync(new RerankRequest
{
    Query = userQuery,
    Documents = candidates.Select(c => c.Text).ToList(),
    TopN = 5,
    Model = "rerank-multilingual-v3.0"
});

var topReranked = resp.Results.Select(r => candidates[r.Index]);

Pipeline

User query
   │
   ├─→ Embed → vector search (top 50)
   ├─→ Keyword search (top 50)
   │     ↓
   │   RRF fuse
   │     ↓
   │  Top 50 candidates
   │     ↓
   │   Reranker → top 5
   │     ↓
   │  Build prompt; call LLM
   ▼
Answer

Cost

Embedding: cheap (already done for indexing).
Vector search: cheap.
Keyword search: cheap.
Reranker: per-query call; ~$0.0001/query (Cohere).
LLM: dominant cost.

Reranker dramatically improves quality at marginal cost.

Quality measurement

Use Ragas / your eval set: - Recall@K (with vs without reranker). - Faithfulness. - Answer relevance.

Typical improvement: +10-20% accuracy with reranker.

Sparse + dense

Beyond BM25, learned sparse retrieval (SPLADE, etc.) improves keyword.

For most apps: BM25 + dense + reranker enough.

Tuning

Top-50 → top-5: typical ratio. Adjust based on quality.
k in RRF: 60 default; insensitive.
Reranker model: try multiple.

Latency budget breakdown

A useful mental model for a sub-second RAG response time:

+----------------------------+--------+------------------------------+
| Stage                      | Budget | Notes                        |
+----------------------------+--------+------------------------------+
| Embed user query           |  20 ms | text-embedding-3-small       |
| Vector search (HNSW, k=50) |  15 ms | warm index, replicas=2       |
| BM25 search (top 50)       |  10 ms | parallel with vector         |
| RRF fuse                   |   1 ms | in-memory                    |
| Reranker (top-50 -> top-5) | 120 ms | Cohere v3 hosted             |
| Build prompt               |  10 ms | template + token counting    |
| LLM first-token            | 400 ms | gpt-4o-mini, streamed        |
+----------------------------+--------+------------------------------+
| Retrieval subtotal         | 166 ms |                              |
| End-to-end first token     | ~580ms |                              |
+----------------------------+--------+------------------------------+

If you blow this budget, the reranker is almost always the longest pole. Drop the candidate set, batch within a single reranker call, or move to a self-hosted BGE reranker on a co-located GPU.

Pros & cons

Aspect	Verdict
Recall@K vs vector-only	+10-20% on most enterprise corpora
Top-3 precision after reranker	+15-30% — biggest LLM-quality lever
Implementation complexity	Low if the store implements `IKeywordHybridSearchable<T>`
Latency	+30-150 ms (reranker dominates); can blow a sub-second SLA
Cost	Reranker is ~$0.0001-0.001/query; LLM still dominates
Predictability	Behaviour shifts when you change embedding or rerank model — pin versions

When to use / when to avoid

Use when: - Corpus mixes prose and exact tokens (IDs, codes, names, statutes). - The LLM downstream is expensive and benefits from tighter top-K. - You have a labeled eval set to measure the lift. - Latency budget allows ~200 ms of retrieval headroom.

Avoid when: - Hard real-time SLA (< 300 ms end-to-end) and reranker latency eats it. - Corpus is tiny (< 5k chunks) — pure vector or even brute force is fine. - You have no offline eval — you cannot tune what you cannot measure. - Queries are ID lookups or structured filters — keyword/filter alone wins.

Interview Q&A

1. Why RRF instead of a convex sum of normalized scores? RRF (1/(k+rank)) is rank-based, so it ignores the absolute score distributions of BM25 (unbounded) and HNSW cosine (0.33-1). Convex sums need per-corpus calibration and break the moment one arm's score distribution shifts (model swap, index growth). RRF is parameter-light (k=60 is robust), well-studied, and what Azure AI Search and Elastic both ship. The tradeoff is you lose the magnitude signal — a top-1 with score 0.99 looks the same to RRF as a top-1 with score 0.51.

2. Cohere rerank-multilingual-v3.0 vs BGE-Reranker — when each? Cohere v3 is the quality leader and the easiest to integrate (REST, ~$0.0001/query, no infra). BGE rerankers (bge-reranker-v2-m3, bge-reranker-large) are open-weights and self-hostable on a single GPU, so they win when data residency forbids egress, when query volume makes per-call pricing painful, or when you want to fine-tune on domain pairs. Quality gap on English is small; on long-tail languages Cohere usually still leads.

3. What latency budget should the reranker get? Rerankers are cross-encoders — they run the full transformer per (query, doc) pair. Cohere's hosted endpoint is ~80-150 ms for top-50, self-hosted BGE is 30-300 ms depending on GPU and batch size. Budget ~150 ms for it and parallelize against the prompt-prep step. If your end- to-end SLA is sub-300 ms, drop to top-10 candidates or skip the reranker.

4. When does reranking actively hurt quality? Three cases I have seen ship bugs. (a) When L1 recall is already poor — the reranker only reorders what L1 found; garbage in, garbage out. (b) When documents are short or templated and the cross-encoder latches on boilerplate. © When your eval set is biased toward the same model family the reranker was trained on, you measure a fake lift that disappears in production.

5. How do you eval-tune a hybrid + rerank pipeline? Build a "golden" set of ~500 query/document-id pairs from real usage plus expert curation. Track recall@50 (L1 quality), recall@K-final, MRR, and nDCG. Sweep three knobs in this order: chunk size, hybrid mix (vector-only / BM25-only / hybrid), reranker on/off and model. Lock the embedding model first — changing it invalidates everything else. Use Ragas for end-to-end faithfulness once retrieval is stable.

6. Why is my RRF top result strictly worse than my vector top result? Because RRF rewards documents that appear in both lists with reasonable ranks. A doc that is rank-1 on vector and absent from BM25 may lose to a doc ranked 3 on vector and 2 on BM25. That is usually correct — agreement across modalities is signal. If it is not, your BM25 analyzer is wrong (no stemming, wrong language) or your vector index has too few neighbors.

7. Where should filters run — before or after retrieval? Push filters as early as possible. In Azure AI Search use Filter in the hybrid request so tenant/category/date apply during HNSW traversal (vectorFilterMode: preFilter). Post-filter is appropriate only when the filter is highly selective and the vector recall would otherwise drop — rare in practice. Pre-filter cuts cost and removes leak risk.

8. The k constant in RRF — does it matter? Almost never. k=60 is the canonical default from the original RRF paper and what every implementation uses. Lower k (20-40) sharpens emphasis on top-ranked items; higher k (80-120) flattens. I have never seen a production lift from tuning it — spend that time on chunking and reranker selection instead.

9. Sparse-dense retrieval — is BM25 enough or do I need SPLADE? For 95% of enterprise RAG, BM25 + dense + reranker hits the ceiling. SPLADE / learned sparse retrieval shines on long-tail domain vocabularies the embedding model under-fits — legal, biomedical, deep technical. The cost is a second inference at index time and a sparse index in your store. Try it only after you have measured BM25 hitting its ceiling on your eval set.

10. How do you cache reranker calls without poisoning quality? Key the cache on (query_hash, doc_id_set_hash, reranker_model_version). Hash the set of candidate doc IDs, not the order — same candidates in a different L1 order produce the same reranker output. Bound TTL by your content's freshness window (24h is typical for RAG). Invalidate on model upgrade by bumping the version segment.

11. Streaming results: can you start the LLM before reranking finishes? Sort of — you can speculatively prompt with the L1 top-3 while the reranker resolves the full top-K, then re-prompt if the reranker swaps the top doc. In practice the engineering cost outweighs the latency win unless you are at very high QPS. Simpler: parallelize reranker with prompt template construction; the reranker is rarely on the hottest path.

12. How would you know it is time to drop hybrid in favor of vector-only? Run a holdout test where you remove the BM25 arm and remeasure. If recall@10 and final answer faithfulness move within noise, BM25 is not buying you anything on this corpus — usually because the content is prose-heavy and the embedding model already handles synonyms well. Drop the keyword index, save the storage and the index-time cost.

Gotchas / common mistakes

Sending vector and BM25 to two different stores without a stable join key. RRF needs the same document IDs across both result lists; mismatched IDs silently corrupt fusion.
Forgetting IsFullTextIndexed = true on the field you fuse against — your "hybrid" call quietly returns vector-only.
Tuning k (RRF constant) before tuning chunk size and embedding model. You are polishing the wrong knob.
Sending the entire chunk to the reranker. Most rerankers cap at 512-2048 tokens per doc; longer inputs are truncated, often before the relevant span.
Caching reranker results without versioning. A model upgrade silently serves stale orderings. Always include the model version in the cache key.
Reranking after a poor L1. If recall@50 is already low, no reranker can save you — fix retrieval first.
Trusting offline lift without an online check. Eval sets drift from real query distributions; ship behind a flag and watch faithfulness in production.

Senior considerations

Always evaluate: hybrid + rerank on your data, not blog posts.
Filter before retrieve: tenant, category — saves cost.
Cache reranker results for repeated queries.
Monitor: track per-stage latency / cost.