Vector DB Tuning
Key Points
- 💡 The senior question is never "which vector DB is fastest?" — it is "which knobs do I turn for my corpus, latency budget, and recall target?"
- HNSW is the default index everywhere (Pgvector 0.5+, Qdrant, Azure AI Search). Knobs:
M(graph density / memory),efConstruction(build accuracy),ef/efSearch(query accuracy/latency). - IVF partitions vectors into clusters. Knobs:
nlist(cluster count),nprobe(clusters scanned). Cheaper memory, lower recall ceiling than HNSW. - DiskANN (Microsoft, used by Azure AI Search and Cosmos DB) keeps the index on SSD — low memory cost, great for >100M vectors.
- Quantization is the second-biggest cost lever. PQ compresses 8-32×, SQ drops float32 → int8 / int4. Lossy but usually <1% recall hit.
- Memory budget math first: 1M × 1536-d × 4 bytes = 6 GB raw; with int8 SQ ~1.5 GB; with PQ-32 ~190 MB.
- Filtered search without a payload/secondary index forces a full scan. Always index the fields you filter on (
tenant_id,status, dates). - Reindex when the embedding model or schema changes; upsert incrementally otherwise.
- ⚠️ p99 < 100 ms is aggressive for vector search; <500 ms is reasonable. Above that, your corpus is too big for the index strategy you picked.
Concepts (deep dive)
The index-type cheat sheet
Corpus size Memory cheap? Latency target Pick
----------- ------------- -------------- ----
< 100k doesn't matter any Brute force / flat
100k - 1M yes <50 ms HNSW in-memory (float32)
1M - 100M sometimes <100 ms HNSW + SQ-int8 / PQ
> 100M no <500 ms DiskANN, or sharded HNSW
Brute force is criminally underused. For < 100k vectors, cosine over a float[][] in memory is ~5 ms and 0% recall loss. Most "we need a vector DB" projects do not actually need one.
HNSW — the workhorse
Hierarchical Navigable Small World: a multi-layer graph where each node connects to its M nearest neighbours, search starts at the top sparse layer and walks down.
Layer 2: A --------- F --------- K (sparse, long jumps)
| | |
Layer 1: A - B - C - F - G - H - K (medium)
| | | | | | |
Layer 0: A - B - C - D - E - F - G - H - I - J - K (full graph)
| Knob | Build / Query | Effect |
|---|---|---|
M | Build | Edges per node. Higher = better recall, more memory. Default 16-64. |
efConstruction | Build | Candidates considered while building. Higher = slower build, better graph. Default 100-400. |
ef / efSearch | Query | Candidates considered per query. Higher = better recall, slower query. The runtime knob. |
Tuning recipe: fix M=32, efConstruction=200, then sweep ef from 32 → 512 against your eval set. Stop when recall@10 plateaus. Typical sweet spot: ef=64-128 for in-memory, ef=200-400 if you need >0.95 recall.
IVF — the budget option
Inverted File index: cluster vectors into nlist Voronoi cells using k-means; at query time scan only the nprobe nearest cells.
| Knob | Effect |
|---|---|
nlist | Number of clusters. Heuristic: ~sqrt(N) for N vectors. |
nprobe | Clusters scanned per query. Higher = recall up, latency up. |
Pgvector's ivfflat was the v0.4 default; in 0.5+ prefer HNSW unless memory is binding.
DiskANN — for >100M vectors
DiskANN (Microsoft Research, used in Azure AI Search vector mode and Cosmos DB) keeps the graph on SSD with a small in-memory cache layer. Trade ~2× latency for ~10× memory savings vs HNSW. Perfect for corpus sizes where loading the full HNSW index into RAM is infeasible (e.g., a 500M-document enterprise search).
Quantization — the second-biggest lever
| Method | Compression | Recall hit | Notes |
|---|---|---|---|
| Float32 (raw) | 1× | 0% | Baseline. |
| Float16 | 2× | <0.5% | Cheap win, broadly supported. |
| Int8 SQ (Scalar Quantization) | 4× | 1-2% | Native in Qdrant, Pgvector hnsw, Azure AI Search. |
| Int4 SQ | 8× | 2-5% | Newer; Qdrant supports. |
| PQ-8 (Product Quantization) | 16-32× | 2-7% | Heavily lossy; usually combined with rerank-via-raw. |
| Binary (bit-packed) | 32× | 5-15% | Aggressive; only with reranking. |
The classic high-recall pipeline: search PQ-compressed vectors for top-200 candidates, then rerank by recomputing exact float32 distance for those 200. Best of both worlds — small index, near-raw recall.
Memory budget math
| Setup | Per-vector | 1M vectors | 100M vectors |
|---|---|---|---|
| float32, dim=1536 | 6.0 KB | 6.0 GB | 600 GB |
| float16, dim=1536 | 3.0 KB | 3.0 GB | 300 GB |
| int8 SQ, dim=1536 | 1.5 KB | 1.5 GB | 150 GB |
| PQ-32, dim=1536 | ~190 B | ~190 MB | 19 GB |
Add ~30-50% for HNSW graph edges. Multiply by 2-3 if you replicate.
Pgvector specifics
vectorextension;CREATE EXTENSION vector;.- Two index types:
ivfflat(legacy) andhnsw(since 0.5.0 — pick this). - HNSW DDL:
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); - Query-time
ef:SET hnsw.ef_search = 100;per session. - IVF sizing:
lists ≈ rows / 1000for <1M rows,rows / 100000for huge tables. - ⚠️
ivfflatrequires training data — build the index after the table is populated, not before.
Qdrant specifics
- HNSW per collection;
m,ef_construct,efconfigurable. - Payload indexes are critical: filtering on un-indexed fields forces a full scan. Always
create_payload_indexfortenant_id,status, etc. - Built-in scalar / product / binary quantization — toggle via collection config.
- Filterable HNSW (Qdrant's "filterable index") prunes during graph traversal — much faster than post-filter.
Azure AI Search specifics
vectorfield withalgorithm: "hnsw"or"exhaustiveKnn".- Vector + BM25 hybrid via
search+vectorQueriesin one call; RRF fusion built in. - Quantization: scalar (int8) and binary supported as
compression. - DiskANN is the underlying tech for "Disk ANN" mode at large scale.
- Per-replica HNSW; scale-out by partition + replica.
Filter-then-search vs search-then-filter
Pre-filter (preferred): Post-filter (fragile):
+--------------------+ +--------------------+
| filter: | | vector search: |
| tenant=42 | | top 200 |
| active=true | +--------------------+
+--------------------+ |
| v
v +--------------------+
+--------------------+ | filter |
| vector search | | tenant=42 |
| over subset | | active=true |
+--------------------+ +--------------------+
| |
v v
Top-K Maybe < K results!
Always pre-filter. Post-filtering can return fewer than K results, or zero, when the filter is selective.
How it works under the hood
HNSW build
for each new vector v:
assign random level L (geometric distribution)
enter at top layer, greedy-walk to nearest neighbour at each layer
at each layer ≤ L: connect v to up to M nearest neighbours
prune over-connected nodes to keep degree bounded
Build cost is O(N · log N · efConstruction). Memory is O(N · M) edges plus the vectors.
HNSW search
start at entry point in top layer
at each layer:
greedy-walk: jump to neighbour closer to query
descend to layer 0, run beam search with width=ef
return top-K from the beam
Search cost is O(log N · ef · D) for D-dimensional vectors.
Code: correct vs wrong
✅ Correct: Pgvector HNSW with ef_search per query
CREATE INDEX ON docs
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Per query, raise ef for higher recall on a hard query:
SET LOCAL hnsw.ef_search = 100;
SELECT id, 1 - (embedding <=> $1) AS sim
FROM docs
WHERE tenant_id = $2 AND active -- pre-filter
ORDER BY embedding <=> $1
LIMIT 10;
❌ Wrong: building HNSW on an empty table
CREATE TABLE docs (id bigserial, embedding vector(1536));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops); -- empty graph
INSERT INTO docs ... -- inserts now hit a useless graph
Build the index after bulk-loading, or accept slow inserts.
✅ Correct: Qdrant with payload index + scalar quantization
await client.CreateCollectionAsync("docs", new CreateCollection
{
VectorsConfig = new VectorParams
{
Size = 1536,
Distance = Distance.Cosine,
QuantizationConfig = new ScalarQuantization
{
Type = ScalarQuantizationType.Int8,
AlwaysRam = true // keep quantized vectors in RAM
},
HnswConfig = new HnswConfigDiff { M = 32, EfConstruct = 256 }
}
});
// Index the field we filter on, or filtered queries collapse to scan.
await client.CreatePayloadIndexAsync("docs", "tenant_id", PayloadSchemaType.Keyword);
❌ Wrong: filter without payload index
// "tenant_id" has no payload index; this forces a full collection scan.
await client.SearchAsync("docs", queryEmb, filter: Match("tenant_id", "acme"), limit: 10);
✅ Correct: Microsoft.Extensions.VectorData with provider hints
public sealed class Doc
{
[VectorStoreKey] public Guid Id { get; set; }
[VectorStoreData(IsIndexed = true)] public string TenantId { get; set; } = "";
[VectorStoreData] public string Text { get; set; } = "";
[VectorStoreVector(Dimensions: 1536, DistanceFunction = DistanceFunction.CosineSimilarity,
IndexKind = IndexKind.Hnsw)]
public ReadOnlyMemory<float> Embedding { get; set; }
}
IsIndexed = true flags fields the provider should secondary-index for filtering — the provider-agnostic equivalent of "create payload index".
Design patterns for this topic
Pattern 1 — "Quantize + rerank exact"
- Intent: small index, raw-quality top-K.
- Tactics: PQ-compressed search for top-200, then re-score the 200 with full float32 distance, keep top-10.
Pattern 2 — "Hot tier / cold tier"
- Intent: keep low-latency budget on the 5% of vectors users actually hit.
- Tactics: HNSW in-memory for "recent / important" partition; DiskANN for the long tail.
Pattern 3 — "Pre-filter via partition key"
- Intent: multi-tenant isolation without hurting recall.
- Tactics: one collection (or shard) per tenant, or use built-in tenant-aware filtering on
tenant_idwith a payload index.
Pattern 4 — "Versioned indexes for embedding-model upgrade"
- Intent: swap embedding models without downtime.
- Tactics: build
docs_v2alongsidedocs_v1; backfill; flip read traffic; drop v1 after a soak period.
Pattern 5 — "Raise ef per query class"
- Intent: spend latency only where it matters.
- Tactics:
ef=64for autocomplete,ef=256for "show me everything we know about X". Per-call, not globally.
Pros & cons / trade-offs
| Aspect | HNSW | IVF | DiskANN |
|---|---|---|---|
| Memory | High | Medium | Low |
| Build time | Slow | Fast | Slow |
| Insert/update cost | Cheap | Re-cluster expensive | Expensive |
| Recall ceiling (tuned) | ~0.99 | ~0.95 | ~0.97 |
| Best for | <100M vectors, latency-critical | Frequent rebuilds OK | >100M vectors, memory-bound |
When to use / when to avoid
Tune (don't just raise hardware) when: - p99 search latency exceeds your budget by >2× — you are paying for a misconfigured index. - Recall@10 against your eval set is <0.85 — you are missing relevant docs the embedding model knows about. - Memory cost is dominating the line item — quantize before scaling out.
Skip the tuning rabbit hole when: - Corpus is < 100k vectors — brute force in process and move on. - Your retrieval bottleneck is upstream (chunking, embedding model, query phrasing) — see Embedding Model Evaluation. - You have no offline eval — you cannot measure what tuning bought you.
Interview Q&A
1. M vs ef in HNSW — when do you tune which? M is a build-time decision: it sets graph density and therefore memory and the recall ceiling. You change it rarely, and changing it means rebuilding the index. ef (or efSearch) is the runtime knob: it sets how many candidates are considered per query. You tune it per workload — ef=64 for cheap exploratory queries, ef=256 when you need >0.95 recall on the hard ones. Start with M=16-32 and sweep ef against your eval set first; only revisit M if ef cannot reach the recall you need.
2. When is IVF preferable to HNSW? Three cases. (a) Memory is binding and you cannot quantize for some reason — IVF's flat storage is leaner than HNSW's edge graph. (b) You re-cluster the corpus often (frequent bulk reload) — IVF rebuilds faster than HNSW. © Legacy Pgvector pre-0.5 — ivfflat was the only choice. In 2026 on modern stores, HNSW with int8 quantization beats IVF on almost every axis.
3. What is DiskANN and when do you reach for it? DiskANN is Microsoft Research's SSD-resident graph index, used by Azure AI Search and Cosmos DB at large scale. It keeps the graph on disk with a small in-memory cache, trading ~2× query latency for ~10× memory savings vs HNSW. Reach for it above ~100M vectors when loading the full HNSW into RAM is impractical or expensive.
4. PQ vs SQ vs binary quantization — what is the trade-off? Scalar quantization (int8) is the safe default — 4× smaller, <2% recall hit, supported everywhere. Product quantization splits vectors into sub-spaces and codebooks them — 16-32× smaller, 2-7% recall hit, usually combined with a re-rank-by-exact-distance step. Binary quantization (1 bit per dim) is 32× smaller and only works combined with reranking. Pick SQ as default; PQ when memory is binding; binary only for huge corpora with rerank.
5. Why does my filtered query become 100× slower? You filtered on a field with no secondary / payload index. The store cannot intersect the filter with the graph traversal, so it falls back to either post-filtering (scan top-N then drop most) or a full scan. Always create a payload/secondary index on the columns you filter on (tenant_id, status, created_at). In Pgvector this is a separate B-tree; in Qdrant it is create_payload_index; in Azure AI Search it is filterable: true on the field.
6. How do you size memory for a 10M-vector / 1536-dim corpus? Raw float32: 10M × 1536 × 4 = 60 GB. HNSW edges with M=32: ~5-10 GB more. Replicate ×2: 130-140 GB total. With int8 SQ: cut by 4× to ~35 GB total — fits on a single mid-size box. With PQ-32: ~2-4 GB — fits on a laptop, but you need re-rank for quality. Always run the math before picking the SKU.
7. When do you reindex vs upsert? Upsert (incremental) when content changes but the embedding model and schema are stable — Pgvector, Qdrant, Azure AI Search all handle in-place updates fine. Reindex (rebuild from scratch) when you change the embedding model, the dimension, the distance function, or the index parameters that need a rebuild (M, efConstruction). The right pattern is versioned indexes — docs_v2 alongside docs_v1, backfill, swap reads atomically, retire v1 after a soak.
8. Pgvector lists for ivfflat — how do you size it? Heuristic: lists ≈ rows / 1000 for tables under 1M rows, rows / 100000 for huge tables. But this is the legacy path. In Pgvector 0.5.0+ you should use HNSW unless memory really is binding. The HNSW knobs (m, ef_construction, hnsw.ef_search) are simpler to reason about and the recall ceiling is higher.
9. p99 100 ms vector search — what would you blame first? In order: (1) ef/efSearch too high — drop it and re-measure recall, you may be over-paying. (2) No payload/secondary index on filter fields — most "slow vector search" is actually slow filtering. (3) Cold-cache misses — HNSW likes hot RAM; warm it on startup. (4) Embedding the user query at request time without caching identical queries. (5) Replicating across regions and the index lives on the wrong one. Vector ANN itself is rarely the actual bottleneck below 10M vectors.
10. How do you migrate from one embedding model to another without downtime? Versioned indexes. Stand up docs_v2 next to docs_v1, backfill v2 in batches (cost: re-embedding the corpus — non-trivial above 100M docs). Run shadow reads against both, compare recall on your eval set. When v2 wins, flip writes to v2 only, keep dual reads for a soak window, then retire v1. Plan for the embedding cost up front — re-embedding 100M docs at $0.0001 / 1k tokens with ~500 tokens per doc is ~$5k.
11. Filterable HNSW vs post-filter — does it matter? A lot, especially for selective filters. Naive post-filter takes top-N from the graph then drops most of them — if your filter selects 1% of docs and you take top-100, you get ~1 result. Filterable HNSW (Qdrant supports it; Azure AI Search has vectorFilterMode: preFilter) prunes during traversal so the top-K after filtering is actually K. Always check what your store does by default and configure pre-filter explicitly when available.
12. Why might the same query return different top docs across two shards? HNSW is approximate. Different shards have different graph structures, so the K nearest in their shard may not be the global K nearest. The fix is to over-fetch per shard and merge: ask each shard for top K * over_fetch_factor, merge, take global top-K. Most production stores do this for you with a tunable oversample parameter.
Gotchas / common mistakes
- ⚠️ Building HNSW before bulk-loading. The graph is empty; every insert pays full graph cost. Bulk-load first, index second.
- ⚠️ Filtering on un-indexed fields. Vector search collapses to a full scan; latency 100×.
- ⚠️ Tuning
efglobally. Different query classes need differentef; set per-call. - ⚠️ Quantizing without measuring recall on your eval set. Default int8 is usually safe; PQ-32 may quietly drop recall 5-10% on your domain.
- ⚠️ Forgetting graph overhead in memory math. Add ~30-50% on top of raw vector size for HNSW edges.
- ⚠️ Mismatched distance functions across components. Embedding model trained for cosine, but you indexed with L2 → silently degraded recall.
- ⚠️ No
versionsegment in your index name. When the embedding model changes, you cannot run two indexes side-by-side, so you cannot migrate without downtime. - ⚠️ Over-replicating an in-memory index. 3 replicas × 60 GB index = 180 GB committed. Use DiskANN or quantize.
- ⚠️ Not warming the index after deploy. First queries on a cold container hit disk; p99 spikes for ~5 minutes.
- ⚠️ Trusting blog-post benchmarks. Always measure on your data, your filters, your hardware.
Further reading
- Pgvector: HNSW indexes
- Qdrant HNSW configuration
- Qdrant quantization
- Azure AI Search: vector index algorithms
- Azure AI Search: vector quantization and storage
- DiskANN: Fast Accurate Billion-point Nearest Neighbor Search (Microsoft Research)
- HNSW paper (Malkov & Yashunin)
- Microsoft.Extensions.VectorData abstractions
API note:
Microsoft.Extensions.VectorDataexposesIndexKind.Hnsw,IndexKind.IvfFlat,IndexKind.DiskAnn, andIndexKind.Flatas portable hints; the underlying provider picks its best match. (Verify on Microsoft Learn before pinning a version.)