Store: Azure AI Search

Key Points

Azure AI Search (formerly Cognitive Search) — managed search engine on Azure.
Hybrid search: vector + keyword + filtering. Best-in-class retrieval quality.
Semantic ranker: Microsoft's reranker baked in.
Tight integration with Microsoft.Extensions.VectorData.
Cost: tiered (Basic, Standard, Storage Optimized). Storage + compute units.

Setup

var searchClient = new SearchIndexClient(
    new Uri("https://my-search.search.windows.net"),
    new DefaultAzureCredential());

var collection = new AzureAISearchCollection<string, MyDocument>(searchClient, "my-index");

await collection.EnsureCollectionExistsAsync();

Record

public class MyDocument
{
    [VectorStoreKey]
    public string Id { get; set; } = "";

    [VectorStoreData(IsIndexed = true, IsFullTextIndexed = true)]
    public string Title { get; set; } = "";

    [VectorStoreData(IsFullTextIndexed = true)]
    public string Content { get; set; } = "";

    [VectorStoreData(IsIndexed = true)]
    public string Category { get; set; } = "";

    [VectorStoreVector(Dimensions: 1536, DistanceFunction = DistanceFunction.CosineSimilarity)]
    public ReadOnlyMemory<float>? Embedding { get; set; }
}

IsFullTextIndexed = true enables BM25 keyword search alongside vector.

Hybrid search

var hybridResults = collection.HybridSearchAsync(
    new HybridSearchRequest<MyDocument>
    {
        VectorQuery = queryEmbedding,
        KeywordQuery = userQuery,
        Top = 10,
        Filter = d => d.Category == "Engineering"
    });

await foreach (var r in hybridResults)
    Console.WriteLine($"{r.Record.Title}: {r.Score}");

Vector + BM25 fused via reciprocal rank fusion. Best general-purpose retrieval.

Semantic ranker

Azure AI Search's built-in semantic reranker uses LLM-trained reranker:

new SearchOptions
{
    QueryType = SearchQueryType.Semantic,
    SemanticSearch = new() { SemanticConfigurationName = "default" }
};

Improves top-K quality further. Premium tier feature.

Faceted filtering

new SearchOptions
{
    Facets = { "Category", "Author" }
};

Returns counts per facet — populate filter UI.

Index design

Searchable: full-text search.
Filterable: equality / range filters.
Sortable: ORDER BY.
Facetable: aggregations.

[VectorStoreData(IsIndexed = true)]   // filterable
[VectorStoreData(IsFullTextIndexed = true)]   // searchable

Tiers

Tier	Notes
Free	Demos
Basic	Small prod
S1, S2, S3	Standard production
Storage Optimized (L1, L2)	Large datasets

Vector compression

Azure AI Search supports binary / scalar / matryoshka compression — reduces storage. Trade quality.

Indexer (built-in ingestion)

Azure AI Search has built-in indexers from Blob, SQL, Cosmos, etc. Auto-incremental.

resource indexer 'Microsoft.Search/searchServices/indexers@...' = { ... }

For most apps: do ingestion in your code (more control). Use indexer for simple cases.

Auth

API key (admin/query).
Managed identity (preferred).

new SearchIndexClient(uri, new DefaultAzureCredential())

Pricing

Search units (compute).
Storage.
Documents (limit).

For 1M chunks: Standard S1 typical (~$250/mo).

Query path (hybrid + semantic ranker)

 user query (string)
        │
        ├──► IEmbeddingGenerator ──► float[1536]  ┐
        │                                         │
        └──► tokenizer / analyzer (BM25)          │
                       │                          │
                       ▼                          ▼
              keyword query (top 50)    vector query (k=50, HNSW)
                       │                         │
                       └───────► RRF fuse ◄──────┘
                                  │  (k=60 constant)
                                  ▼
                        Fused L1 top-50 candidates
                                  │
                                  ▼
                Semantic ranker (L2, Bing model)
                - rerankerScore 0.00 - 4.00
                - captions + answers + highlights
                                  │
                                  ▼
                     top-K (typ. 3-5) ──► LLM prompt

IsFullTextIndexed = true populates the BM25 side; [VectorStoreVector] feeds the HNSW side. Filters (IsIndexed = true) are applied pre- or post-vector depending on vectorFilterMode.

Index field anatomy

+--------------------------------------------------------------+
| Index: docs-v3                                               |
+----------------+--------+--------+--------+--------+---------+
| Field          | search | filter | facet  | sort   | vector  |
+----------------+--------+--------+--------+--------+---------+
| Id   (key)     |   -    |   Y    |   -    |   -    |    -    |
| Title          |   Y    |   Y    |   -    |   Y    |    -    |
| Content        |   Y    |   -    |   -    |   -    |    -    |
| Category       |   -    |   Y    |   Y    |   -    |    -    |
| TenantId       |   -    |   Y    |   -    |   -    |    -    |
| UpdatedAt      |   -    |   Y    |   -    |   Y    |    -    |
| Embedding      |   -    |   -    |   -    |   -    |  1536f  |
+----------------+--------+--------+--------+--------+---------+

Each true flag costs storage and write amplification — index only what you will actually query.

Pros & cons

Aspect	Verdict
Hybrid + semantic ranker quality	Best-in-class on Azure; benchmark-proven
Operational maturity	GA, SLA, Entra ID, Private Link, diagnostics
Indexer + integrated vectorization	Pull-mode ingestion from Blob/Cosmos/SQL out of the box
Cost floor	No usable free-tier prod; Basic+ required at small scale
Vendor lock-in	Azure-only; no portable export of HNSW graph
Schema rigidity	Index changes often require rebuild; plan a versioned alias

When to use / when to avoid

Use when: - You are already on Azure and need Entra ID / Private Link / RBAC. - You need hybrid (BM25 + vector) with a managed reranker out of the box. - Multi-tenant SaaS with index-per-tenant or partition-key filtering. - Compliance demands an audited PaaS rather than self-hosted Qdrant/Weaviate.

Avoid when: - Tiny corpora (< 50k chunks) — pgvector or in-process FAISS is cheaper. - You need exotic vector ops (sparse-only, multi-vector ColBERT) not yet GA. - You want a single OLTP+vector store — Cosmos DB or Postgres is simpler. - Hard cost ceiling under ~$80/mo — Basic tier is the realistic floor.

Interview Q&A

1. When is hybrid search strictly better than vector-only? When queries contain exact tokens the embedding model under-weights — IDs, SKUs, error codes, statute references, code symbols. BM25 nails those; the vector arm catches paraphrase. Microsoft's own benchmarks show hybrid + semantic ranker beating pure vector on most enterprise corpora. The cost is one extra parallel query and an RRF merge — effectively free.

2. Semantic ranker vs Cohere Rerank — which would you pick on Azure? Default to the built-in semantic ranker: it ships with the service, is billed per query unit, runs in-region (data residency, no egress), and is trained on the Bing corpus with broad multilingual coverage. Reach for Cohere rerank-multilingual-v3.0 only when you have measured a quality gap on your eval set, need cross-lingual reranking the semantic ranker handles poorly, or want to stay portable across vector stores.

3. Indexer (pull) vs custom push ingestion — when do you choose which? Use indexers when the source is Blob/Cosmos/SQL/OneLake, change-tracking fits the built-in watermark, and you can express transformations as skillsets. Use push ingestion (UpsertAsync from your own pipeline) when you need streaming, exactly-once semantics, custom chunking, dead-letter queues, or feature-flagged rollouts. Most production RAG I have shipped ends up push-based because chunking strategy is the moat.

4. How does scalar vs binary quantization change the trade-off? Scalar (int8) gives ~4x reduction with negligible recall loss — turn it on by default. Binary (1 bit) gives up to ~28x but only behaves well on dimensions ≥ 1024 and embeddings centered at zero (OpenAI text-embedding-3, Cohere). Both want oversampling + rescore against the original full-precision vector to recover recall — keep defaultOversampling ≥ 4 unless you measured otherwise.

5. Multi-tenant: index-per-tenant, field filter, or service-per-tenant? Three valid patterns. Index-per-tenant gives the strongest isolation and per-tenant schema evolution but caps at the service's index limit. A shared index with TenantId as a filterable field scales to millions of tenants and is cheapest, but you must enforce the filter in every query (security trimming) and tune for hot tenants. Service-per-tenant is for regulated workloads needing physical isolation. Default to shared-index + filter; promote noisy neighbours to their own index.

6. What is your reindex / freshness strategy when the embedding model changes? Treat the embedding model version as part of the index identity. Build docs-v3-2026-04 alongside the live docs-v2-2026-01, dual-write during backfill, swap an alias at the end, then drop the old index. Never reuse a field name with mixed embedding versions — vector spaces are not comparable across models. For trickle freshness, push updates with a version/etag field and use the indexer's high-water mark.

7. How do you secure an Azure AI Search index in 2026 best-practice? Disable API keys, enable RBAC-only (Search Service Contributor, Search Index Data Contributor, Search Index Data Reader), connect via DefaultAzureCredential with a user-assigned managed identity, put the service behind Private Link, and use document-level access control via a filterable acl field populated by your auth tier. Audit through Diagnostic Settings to Log Analytics.

8. RRF score is 0.03 — is the result bad? No. RRF scores are bounded by 1/(k+rank) summed across query arms, so they look tiny next to BM25 (unbounded) or HNSW cosine (0.33-1.0). Compare RRF scores only against other RRF scores from the same query. To present to users or to threshold, use @search.rerankerScore (0.00-4.00) from the semantic ranker — that range is calibrated.

9. When does the semantic ranker hurt rather than help? On highly structured, code-like, or numeric content the L2 model hasn't seen — log lines, financial tables, rare-language docs. It also caps inputs at ~8,960 tokens, so very long fields get truncated. And it only reranks the top 50 from L1, so if your L1 retrieval misses the gold doc, L2 can't recover it — invest in recall@50 first.

10. Vector compression — when is it a footgun? Binary quantization on small dimensions (< 1024) or non-zero-centered embeddings can drop recall by 10-20%. Scalar is safer. The non-obvious one: rescoreStorageMethod = discardOriginals saves ~50% storage but disables rescoring forever — you cannot turn it back on without a full reindex. For binary, discardOriginals is fine (dot-product rescore works); for scalar, keep originals.

11. How would you design an A/B test between two retrieval pipelines? Build a labeled eval set first (queries + ideal doc IDs from your domain experts). Run both pipelines headless, compute recall@10, MRR, nDCG, and ranker-agreement. Then ship behind a flag, log the served pipeline per query, and watch downstream signals — answer faithfulness, click-through, follow-up rate. Never trust offline metrics alone; never trust online metrics on < ~500 queries.

Gotchas / common mistakes

Forgetting IsFullTextIndexed = true on the field you want BM25 on — hybrid silently degrades to vector-only on that arm.
Mutating an index schema in place. Most field-attribute changes require a rebuild. Always work behind an alias and a versioned index name.
Putting tenant filtering only in the client. A bug in one query path leaks data across tenants. Enforce at the index layer with a security-trim filter and a unit test.
Using API keys in production. Switch the service to RBAC-only and authenticate with DefaultAzureCredential; rotate any leaked admin key immediately.
Ignoring maxTextRecallSize and top interaction with semantic ranker. L2 needs ~50 candidates from L1 to do its job; setting top: 5 with no k override starves it.
Mixing embedding model versions in the same vector field. The space is incoherent and recall collapses.
Treating RRF scores as probabilities. They are not calibrated and not comparable across queries. Threshold on rerankerScore instead.
Skipping eval. "Hybrid + semantic" is a strong default, not a guarantee — run Ragas / your own labeled set on every pipeline change.

Senior considerations

Default to hybrid + semantic ranker for quality.
Index only what you'll filter on — write cost.
Replicas for read scale; partitions for storage.
Embeddings stay in sync with index — re-index on model upgrade.

Comparison

Store	Pros	Cons
Azure AI Search	Hybrid + semantic ranker; Azure-native	Azure-only; cost
Cosmos DB	Same DB as app	Less hybrid maturity
Qdrant	Open; fast	Self-host or cloud
pgvector	Postgres-native	Limited filtering

For Azure shops + enterprise RAG: Azure AI Search is the default.