Vector DB Tuning

Key Points

💡 The senior question is never "which vector DB is fastest?" — it is "which knobs do I turn for my corpus, latency budget, and recall target?"
HNSW is the default index everywhere (Pgvector 0.5+, Qdrant, Azure AI Search). Knobs: M (graph density / memory), efConstruction (build accuracy), ef/efSearch (query accuracy/latency).
IVF partitions vectors into clusters. Knobs: nlist (cluster count), nprobe (clusters scanned). Cheaper memory, lower recall ceiling than HNSW.
DiskANN (Microsoft, used by Azure AI Search and Cosmos DB) keeps the index on SSD — low memory cost, great for >100M vectors.
Quantization is the second-biggest cost lever. PQ compresses 8-32×, SQ drops float32 → int8 / int4. Lossy but usually <1% recall hit.
Memory budget math first: 1M × 1536-d × 4 bytes = 6 GB raw; with int8 SQ ~1.5 GB; with PQ-32 ~190 MB.
Filtered search without a payload/secondary index forces a full scan. Always index the fields you filter on (tenant_id, status, dates).
Reindex when the embedding model or schema changes; upsert incrementally otherwise.
⚠️ p99 < 100 ms is aggressive for vector search; <500 ms is reasonable. Above that, your corpus is too big for the index strategy you picked.

Concepts (deep dive)

The index-type cheat sheet

Corpus size    Memory cheap?  Latency target  Pick
-----------    -------------  --------------  ----
< 100k         doesn't matter        any      Brute force / flat
100k - 1M      yes                  <50 ms    HNSW in-memory (float32)
1M - 100M      sometimes            <100 ms   HNSW + SQ-int8 / PQ
> 100M         no                   <500 ms   DiskANN, or sharded HNSW

Brute force is criminally underused. For < 100k vectors, cosine over a float[][] in memory is ~5 ms and 0% recall loss. Most "we need a vector DB" projects do not actually need one.

HNSW — the workhorse

Hierarchical Navigable Small World: a multi-layer graph where each node connects to its M nearest neighbours, search starts at the top sparse layer and walks down.

Layer 2:    A --------- F --------- K           (sparse, long jumps)
            |           |           |
Layer 1:    A - B - C - F - G - H - K           (medium)
            |   |   |   |   |   |   |
Layer 0:    A - B - C - D - E - F - G - H - I - J - K   (full graph)

Knob	Build / Query	Effect
`M`	Build	Edges per node. Higher = better recall, more memory. Default 16-64.
`efConstruction`	Build	Candidates considered while building. Higher = slower build, better graph. Default 100-400.
`ef` / `efSearch`	Query	Candidates considered per query. Higher = better recall, slower query. The runtime knob.

Tuning recipe: fix M=32, efConstruction=200, then sweep ef from 32 → 512 against your eval set. Stop when recall@10 plateaus. Typical sweet spot: ef=64-128 for in-memory, ef=200-400 if you need >0.95 recall.

IVF — the budget option

Inverted File index: cluster vectors into nlist Voronoi cells using k-means; at query time scan only the nprobe nearest cells.

Knob	Effect
`nlist`	Number of clusters. Heuristic: `~sqrt(N)` for N vectors.
`nprobe`	Clusters scanned per query. Higher = recall up, latency up.

Pgvector's ivfflat was the v0.4 default; in 0.5+ prefer HNSW unless memory is binding.

DiskANN — for >100M vectors

DiskANN (Microsoft Research, used in Azure AI Search vector mode and Cosmos DB) keeps the graph on SSD with a small in-memory cache layer. Trade ~2× latency for ~10× memory savings vs HNSW. Perfect for corpus sizes where loading the full HNSW index into RAM is infeasible (e.g., a 500M-document enterprise search).

Quantization — the second-biggest lever

Method	Compression	Recall hit	Notes
Float32 (raw)	1×	0%	Baseline.
Float16	2×	<0.5%	Cheap win, broadly supported.
Int8 SQ (Scalar Quantization)	4×	1-2%	Native in Qdrant, Pgvector hnsw, Azure AI Search.
Int4 SQ	8×	2-5%	Newer; Qdrant supports.
PQ-8 (Product Quantization)	16-32×	2-7%	Heavily lossy; usually combined with rerank-via-raw.
Binary (bit-packed)	32×	5-15%	Aggressive; only with reranking.

The classic high-recall pipeline: search PQ-compressed vectors for top-200 candidates, then rerank by recomputing exact float32 distance for those 200. Best of both worlds — small index, near-raw recall.

Memory budget math

N vectors × D dimensions × bytes-per-element + graph overhead

Setup	Per-vector	1M vectors	100M vectors
float32, dim=1536	6.0 KB	6.0 GB	600 GB
float16, dim=1536	3.0 KB	3.0 GB	300 GB
int8 SQ, dim=1536	1.5 KB	1.5 GB	150 GB
PQ-32, dim=1536	~190 B	~190 MB	19 GB

Add ~30-50% for HNSW graph edges. Multiply by 2-3 if you replicate.

Pgvector specifics

vector extension; CREATE EXTENSION vector;.
Two index types: ivfflat (legacy) and hnsw (since 0.5.0 — pick this).
HNSW DDL: CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
Query-time ef: SET hnsw.ef_search = 100; per session.
IVF sizing: lists ≈ rows / 1000 for <1M rows, rows / 100000 for huge tables.
⚠️ ivfflat requires training data — build the index after the table is populated, not before.

Qdrant specifics

HNSW per collection; m, ef_construct, ef configurable.
Payload indexes are critical: filtering on un-indexed fields forces a full scan. Always create_payload_index for tenant_id, status, etc.
Built-in scalar / product / binary quantization — toggle via collection config.
Filterable HNSW (Qdrant's "filterable index") prunes during graph traversal — much faster than post-filter.

Azure AI Search specifics

vector field with algorithm: "hnsw" or "exhaustiveKnn".
Vector + BM25 hybrid via search + vectorQueries in one call; RRF fusion built in.
Quantization: scalar (int8) and binary supported as compression.
DiskANN is the underlying tech for "Disk ANN" mode at large scale.
Per-replica HNSW; scale-out by partition + replica.

Filter-then-search vs search-then-filter

Pre-filter (preferred):              Post-filter (fragile):
+--------------------+               +--------------------+
| filter:            |               | vector search:     |
|   tenant=42        |               |   top 200          |
|   active=true      |               +--------------------+
+--------------------+                          |
          |                                     v
          v                          +--------------------+
+--------------------+               | filter             |
| vector search      |               |   tenant=42        |
|   over subset      |               |   active=true      |
+--------------------+               +--------------------+
          |                                     |
          v                                     v
       Top-K                              Maybe < K results!

Always pre-filter. Post-filtering can return fewer than K results, or zero, when the filter is selective.

How it works under the hood

HNSW build

for each new vector v:
    assign random level L (geometric distribution)
    enter at top layer, greedy-walk to nearest neighbour at each layer
    at each layer ≤ L: connect v to up to M nearest neighbours
    prune over-connected nodes to keep degree bounded

Build cost is O(N · log N · efConstruction). Memory is O(N · M) edges plus the vectors.

HNSW search

start at entry point in top layer
at each layer:
    greedy-walk: jump to neighbour closer to query
descend to layer 0, run beam search with width=ef
return top-K from the beam

Search cost is O(log N · ef · D) for D-dimensional vectors.

Code: correct vs wrong

✅ Correct: Pgvector HNSW with `ef_search` per query

CREATE INDEX ON docs
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Per query, raise ef for higher recall on a hard query:
SET LOCAL hnsw.ef_search = 100;
SELECT id, 1 - (embedding <=> $1) AS sim
FROM docs
WHERE tenant_id = $2 AND active     -- pre-filter
ORDER BY embedding <=> $1
LIMIT 10;

❌ Wrong: building HNSW on an empty table

CREATE TABLE docs (id bigserial, embedding vector(1536));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);  -- empty graph
INSERT INTO docs ...   -- inserts now hit a useless graph

Build the index after bulk-loading, or accept slow inserts.

✅ Correct: Qdrant with payload index + scalar quantization

await client.CreateCollectionAsync("docs", new CreateCollection
{
    VectorsConfig = new VectorParams
    {
        Size = 1536,
        Distance = Distance.Cosine,
        QuantizationConfig = new ScalarQuantization
        {
            Type = ScalarQuantizationType.Int8,
            AlwaysRam = true     // keep quantized vectors in RAM
        },
        HnswConfig = new HnswConfigDiff { M = 32, EfConstruct = 256 }
    }
});

// Index the field we filter on, or filtered queries collapse to scan.
await client.CreatePayloadIndexAsync("docs", "tenant_id", PayloadSchemaType.Keyword);

❌ Wrong: filter without payload index

// "tenant_id" has no payload index; this forces a full collection scan.
await client.SearchAsync("docs", queryEmb, filter: Match("tenant_id", "acme"), limit: 10);

✅ Correct: Microsoft.Extensions.VectorData with provider hints

public sealed class Doc
{
    [VectorStoreKey] public Guid Id { get; set; }
    [VectorStoreData(IsIndexed = true)] public string TenantId { get; set; } = "";
    [VectorStoreData] public string Text { get; set; } = "";
    [VectorStoreVector(Dimensions: 1536, DistanceFunction = DistanceFunction.CosineSimilarity,
                       IndexKind = IndexKind.Hnsw)]
    public ReadOnlyMemory<float> Embedding { get; set; }
}

IsIndexed = true flags fields the provider should secondary-index for filtering — the provider-agnostic equivalent of "create payload index".

Design patterns for this topic

Pattern 1 — "Quantize + rerank exact"

Intent: small index, raw-quality top-K.
Tactics: PQ-compressed search for top-200, then re-score the 200 with full float32 distance, keep top-10.

Pattern 2 — "Hot tier / cold tier"

Intent: keep low-latency budget on the 5% of vectors users actually hit.
Tactics: HNSW in-memory for "recent / important" partition; DiskANN for the long tail.

Pattern 3 — "Pre-filter via partition key"

Intent: multi-tenant isolation without hurting recall.
Tactics: one collection (or shard) per tenant, or use built-in tenant-aware filtering on tenant_id with a payload index.

Pattern 4 — "Versioned indexes for embedding-model upgrade"

Intent: swap embedding models without downtime.
Tactics: build docs_v2 alongside docs_v1; backfill; flip read traffic; drop v1 after a soak period.

Pattern 5 — "Raise `ef` per query class"

Intent: spend latency only where it matters.
Tactics: ef=64 for autocomplete, ef=256 for "show me everything we know about X". Per-call, not globally.

Pros & cons / trade-offs

Aspect	HNSW	IVF	DiskANN
Memory	High	Medium	Low
Build time	Slow	Fast	Slow
Insert/update cost	Cheap	Re-cluster expensive	Expensive
Recall ceiling (tuned)	~0.99	~0.95	~0.97
Best for	<100M vectors, latency-critical	Frequent rebuilds OK	>100M vectors, memory-bound

When to use / when to avoid

Tune (don't just raise hardware) when: - p99 search latency exceeds your budget by >2× — you are paying for a misconfigured index. - Recall@10 against your eval set is <0.85 — you are missing relevant docs the embedding model knows about. - Memory cost is dominating the line item — quantize before scaling out.

Skip the tuning rabbit hole when: - Corpus is < 100k vectors — brute force in process and move on. - Your retrieval bottleneck is upstream (chunking, embedding model, query phrasing) — see Embedding Model Evaluation. - You have no offline eval — you cannot measure what tuning bought you.

Interview Q&A

1. M vs ef in HNSW — when do you tune which? M is a build-time decision: it sets graph density and therefore memory and the recall ceiling. You change it rarely, and changing it means rebuilding the index. ef (or efSearch) is the runtime knob: it sets how many candidates are considered per query. You tune it per workload — ef=64 for cheap exploratory queries, ef=256 when you need >0.95 recall on the hard ones. Start with M=16-32 and sweep ef against your eval set first; only revisit M if ef cannot reach the recall you need.

2. When is IVF preferable to HNSW? Three cases. (a) Memory is binding and you cannot quantize for some reason — IVF's flat storage is leaner than HNSW's edge graph. (b) You re-cluster the corpus often (frequent bulk reload) — IVF rebuilds faster than HNSW. © Legacy Pgvector pre-0.5 — ivfflat was the only choice. In 2026 on modern stores, HNSW with int8 quantization beats IVF on almost every axis.

3. What is DiskANN and when do you reach for it? DiskANN is Microsoft Research's SSD-resident graph index, used by Azure AI Search and Cosmos DB at large scale. It keeps the graph on disk with a small in-memory cache, trading ~2× query latency for ~10× memory savings vs HNSW. Reach for it above ~100M vectors when loading the full HNSW into RAM is impractical or expensive.

4. PQ vs SQ vs binary quantization — what is the trade-off? Scalar quantization (int8) is the safe default — 4× smaller, <2% recall hit, supported everywhere. Product quantization splits vectors into sub-spaces and codebooks them — 16-32× smaller, 2-7% recall hit, usually combined with a re-rank-by-exact-distance step. Binary quantization (1 bit per dim) is 32× smaller and only works combined with reranking. Pick SQ as default; PQ when memory is binding; binary only for huge corpora with rerank.

5. Why does my filtered query become 100× slower? You filtered on a field with no secondary / payload index. The store cannot intersect the filter with the graph traversal, so it falls back to either post-filtering (scan top-N then drop most) or a full scan. Always create a payload/secondary index on the columns you filter on (tenant_id, status, created_at). In Pgvector this is a separate B-tree; in Qdrant it is create_payload_index; in Azure AI Search it is filterable: true on the field.

6. How do you size memory for a 10M-vector / 1536-dim corpus? Raw float32: 10M × 1536 × 4 = 60 GB. HNSW edges with M=32: ~5-10 GB more. Replicate ×2: 130-140 GB total. With int8 SQ: cut by 4× to ~35 GB total — fits on a single mid-size box. With PQ-32: ~2-4 GB — fits on a laptop, but you need re-rank for quality. Always run the math before picking the SKU.

7. When do you reindex vs upsert? Upsert (incremental) when content changes but the embedding model and schema are stable — Pgvector, Qdrant, Azure AI Search all handle in-place updates fine. Reindex (rebuild from scratch) when you change the embedding model, the dimension, the distance function, or the index parameters that need a rebuild (M, efConstruction). The right pattern is versioned indexes — docs_v2 alongside docs_v1, backfill, swap reads atomically, retire v1 after a soak.

8. Pgvector lists for ivfflat — how do you size it? Heuristic: lists ≈ rows / 1000 for tables under 1M rows, rows / 100000 for huge tables. But this is the legacy path. In Pgvector 0.5.0+ you should use HNSW unless memory really is binding. The HNSW knobs (m, ef_construction, hnsw.ef_search) are simpler to reason about and the recall ceiling is higher.

9. p99 100 ms vector search — what would you blame first? In order: (1) ef/efSearch too high — drop it and re-measure recall, you may be over-paying. (2) No payload/secondary index on filter fields — most "slow vector search" is actually slow filtering. (3) Cold-cache misses — HNSW likes hot RAM; warm it on startup. (4) Embedding the user query at request time without caching identical queries. (5) Replicating across regions and the index lives on the wrong one. Vector ANN itself is rarely the actual bottleneck below 10M vectors.

10. How do you migrate from one embedding model to another without downtime? Versioned indexes. Stand up docs_v2 next to docs_v1, backfill v2 in batches (cost: re-embedding the corpus — non-trivial above 100M docs). Run shadow reads against both, compare recall on your eval set. When v2 wins, flip writes to v2 only, keep dual reads for a soak window, then retire v1. Plan for the embedding cost up front — re-embedding 100M docs at $0.0001 / 1k tokens with ~500 tokens per doc is ~$5k.

11. Filterable HNSW vs post-filter — does it matter? A lot, especially for selective filters. Naive post-filter takes top-N from the graph then drops most of them — if your filter selects 1% of docs and you take top-100, you get ~1 result. Filterable HNSW (Qdrant supports it; Azure AI Search has vectorFilterMode: preFilter) prunes during traversal so the top-K after filtering is actually K. Always check what your store does by default and configure pre-filter explicitly when available.

12. Why might the same query return different top docs across two shards? HNSW is approximate. Different shards have different graph structures, so the K nearest in their shard may not be the global K nearest. The fix is to over-fetch per shard and merge: ask each shard for top K * over_fetch_factor, merge, take global top-K. Most production stores do this for you with a tunable oversample parameter.

Gotchas / common mistakes

⚠️ Building HNSW before bulk-loading. The graph is empty; every insert pays full graph cost. Bulk-load first, index second.
⚠️ Filtering on un-indexed fields. Vector search collapses to a full scan; latency 100×.
⚠️ Tuning ef globally. Different query classes need different ef; set per-call.
⚠️ Quantizing without measuring recall on your eval set. Default int8 is usually safe; PQ-32 may quietly drop recall 5-10% on your domain.
⚠️ Forgetting graph overhead in memory math. Add ~30-50% on top of raw vector size for HNSW edges.
⚠️ Mismatched distance functions across components. Embedding model trained for cosine, but you indexed with L2 → silently degraded recall.
⚠️ No version segment in your index name. When the embedding model changes, you cannot run two indexes side-by-side, so you cannot migrate without downtime.
⚠️ Over-replicating an in-memory index. 3 replicas × 60 GB index = 180 GB committed. Use DiskANN or quantize.
⚠️ Not warming the index after deploy. First queries on a cold container hit disk; p99 spikes for ~5 minutes.
⚠️ Trusting blog-post benchmarks. Always measure on your data, your filters, your hardware.

Vector DB Tuning

Key Points

Concepts (deep dive)

The index-type cheat sheet

HNSW — the workhorse

IVF — the budget option

DiskANN — for >100M vectors

Quantization — the second-biggest lever

Memory budget math

Pgvector specifics

Qdrant specifics

Azure AI Search specifics

Filter-then-search vs search-then-filter

How it works under the hood

HNSW build

HNSW search

Code: correct vs wrong

✅ Correct: Pgvector HNSW with `ef_search` per query

❌ Wrong: building HNSW on an empty table

✅ Correct: Qdrant with payload index + scalar quantization

❌ Wrong: filter without payload index

✅ Correct: Microsoft.Extensions.VectorData with provider hints

Design patterns for this topic

Pattern 1 — "Quantize + rerank exact"

Pattern 2 — "Hot tier / cold tier"

Pattern 3 — "Pre-filter via partition key"

Pattern 4 — "Versioned indexes for embedding-model upgrade"

Pattern 5 — "Raise `ef` per query class"

Pros & cons / trade-offs

When to use / when to avoid

Interview Q&A

Gotchas / common mistakes

Further reading

Cross-references

Vector DB Tuning

Key Points

Concepts (deep dive)

The index-type cheat sheet

HNSW — the workhorse

IVF — the budget option

DiskANN — for >100M vectors

Quantization — the second-biggest lever

Memory budget math

Pgvector specifics

Qdrant specifics

Azure AI Search specifics

Filter-then-search vs search-then-filter

How it works under the hood

HNSW build

HNSW search

Code: correct vs wrong

✅ Correct: Pgvector HNSW with ef_search per query

❌ Wrong: building HNSW on an empty table

✅ Correct: Qdrant with payload index + scalar quantization

❌ Wrong: filter without payload index

✅ Correct: Microsoft.Extensions.VectorData with provider hints

Design patterns for this topic

Pattern 1 — "Quantize + rerank exact"

Pattern 2 — "Hot tier / cold tier"

Pattern 3 — "Pre-filter via partition key"

Pattern 4 — "Versioned indexes for embedding-model upgrade"

Pattern 5 — "Raise ef per query class"

Pros & cons / trade-offs

When to use / when to avoid

Interview Q&A

Gotchas / common mistakes

Further reading

Cross-references

✅ Correct: Pgvector HNSW with `ef_search` per query

Pattern 5 — "Raise `ef` per query class"