Skip to content

AI Evaluation Harness

Key Points

  • 💡 Shipping LLM features without an eval harness is shipping unverified code. Unit tests do not cover whether the answer is correct, grounded, or safe — only an eval harness does.
  • Microsoft.Extensions.AI.Evaluation is the official Microsoft library (GA in 2025+). It standardizes the data-set / evaluator / reporter triple on top of IChatClient so any provider works.
  • Evaluators score a single response on one dimension: RelevanceEvaluator, CoherenceEvaluator, GroundednessEvaluator, FluencyEvaluator, EquivalenceEvaluator, RetrievalEvaluator, plus your own.
  • LLM-as-judge is the dominant scoring technique for open-ended generation. Use a different model from the one under test — self-evaluation is positively biased.
  • Reference-based metrics (BLEU, ROUGE, exact-match, EquivalenceEvaluator) need a gold answer. Reference-free metrics (Groundedness, Coherence) do not — they grade against the prompt and retrieved context.
  • CI integration is the point. Run eval on every PR; fail the build when a metric regresses by more than a threshold against baseline.
  • ⚠️ Eval is not free. Each LLM-judge call costs tokens. Sample, parallelize, cache, and run the heavy suite nightly.

Concepts (deep dive)

The eval triple: dataset, evaluator, reporter

+-------------+       +-----------+       +----------+
|  Dataset    | --->  | Evaluator | --->  | Reporter |
| (inputs +   |       | (scores   |       | (writes  |
|  expected)  |       |  one      |       |  to disk |
|             |       |  metric)  |       |  / CI)   |
+-------------+       +-----------+       +----------+

A dataset is a list of (input, expected_output_or_rubric) pairs. An evaluator runs the system under test against each input and emits a numeric/categorical score per metric. A reporter collates results, persists them, and renders a delta against the previous run.

The flavours of dataset

Type Source Size Use
Golden set Curated by humans ~50-500 The hard, hand-picked truth. Run on every PR.
Production replay Sampled from real traffic 1k-10k Reflects user distribution; weekly run.
Adversarial / red-team Hand-crafted attack prompts 50-500 Jailbreaks, prompt injection, PII leaks. Pre-release gate.
Synthetic Generated by an LLM ~unbounded Cheap coverage, useful for new feature scaffolding only.

The senior mistake is shipping with only synthetic data. The model that wrote the questions is the same one being graded — bias bakes in.

Reference-based vs reference-free

  • Reference-based: you have the right answer. BLEU/ROUGE for translation/summarization, exact-match for classification, EquivalenceEvaluator for "does this say the same thing as the gold answer".
  • Reference-free: open-ended generation. GroundednessEvaluator ("does the answer cite the context?"), CoherenceEvaluator ("is it well-formed?"), RelevanceEvaluator ("does it address the question?"). The judge is another LLM with a rubric.

For RAG pipelines you almost always want both: groundedness + relevance (reference-free) plus equivalence-against-gold for the curated subset.

Self-evaluation is biased

If GPT-4o generates the answer and GPT-4o judges it, the score is inflated by 3-7 percentage points on most rubrics — the model recognizes its own style. Always separate generator and judge: generate with gpt-4o-mini, judge with gpt-4o (or vice-versa, or use Claude / Gemini as a third opinion). Some teams ensemble three judges and majority-vote.

Built-in evaluators in Microsoft.Extensions.AI.Evaluation

Evaluator What it scores Reference needed?
RelevanceEvaluator Did the answer address the question? No
CoherenceEvaluator Logical flow, grammar, structure No
FluencyEvaluator Linguistic quality No
GroundednessEvaluator Is the answer supported by the supplied context? No (needs context)
EquivalenceEvaluator Does the answer match the ground truth? Yes
RetrievalEvaluator Did retrieval surface the right docs? Yes (gold doc IDs)
Your custom IEvaluator Domain-specific: tone, schema, refusal-correctness, etc. Either

Reference-based metrics: BLEU, ROUGE, exact-match

Outside Microsoft.Extensions.AI.Evaluation you still reach for classic NLP metrics for tasks where they make sense:

Metric What it measures Use
Exact match String equality (after normalization) Classification labels, structured extraction
BLEU n-gram overlap precision (penalizes brevity) Machine translation
ROUGE-L Longest common subsequence recall Summarization with reference summaries
F1 token overlap Token-level precision × recall Span-based QA

These are deterministic and cheap (no LLM call), but they grade surface form, not meaning. A perfect paraphrase of the gold answer scores zero on exact-match. Combine with LLM-judge for open-ended generation.

Latency and cost budget

+-----------------------------------------------+
| Per eval run on 100-entry golden set          |
+-----------------+-----------+-----------------+
| Stage           | Tokens    | Wall clock      |
+-----------------+-----------+-----------------+
| Generate (SUT)  | 50k in    | 60-120 s        |
|                 | 30k out   | (parallel x 8)  |
| Judge x 4       | 100 calls | 90-180 s        |
| evaluators      | per call  | (parallel x 8)  |
| Write report    | -         | <1 s            |
+-----------------+-----------+-----------------+
| Total cost      | ~$0.50-2  | per PR          |
+-----------------+-----------+-----------------+

Parallelism is the lever — Parallel.ForEachAsync with a concurrency cap of 4-8 keeps the judge endpoint happy and slashes wall clock 5×.


How it works under the hood

+----------------------------------------------------------+
|                    EvaluationContext                      |
|  - input messages                                         |
|  - retrieved context (for groundedness)                   |
|  - expected output (for equivalence)                      |
|  - response under test                                    |
+----------------------------------------------------------+
              |               |              |
              v               v              v
       +-----------+   +-------------+  +--------------+
       | Relevance |   | Groundedness|  | Equivalence  |
       | Evaluator |   |  Evaluator  |  |  Evaluator   |
       | (LLM-judge|   | (LLM-judge  |  | (LLM-judge   |
       |  rubric)  |   |  rubric)    |  |  + gold)     |
       +-----------+   +-------------+  +--------------+
              |               |              |
              v               v              v
       Score: 4/5      Score: 3/5      Score: PASS
              |               |              |
              +-------+-------+--------------+
                      v
               +-------------+
               |  Reporter   |
               |  - JSON     |
               |  - HTML     |
               |  - CI gate  |
               +-------------+

Each built-in evaluator is a DelegatingChatClient-like component that:

  1. Wraps a judge IChatClient you supply.
  2. Builds a rubric prompt ("Rate the relevance of the response on a 1-5 scale where 1 is...").
  3. Asks for structured output (JSON with score + reasoning).
  4. Returns an EvaluationMetric with the numeric score and the judge's chain-of-thought as evidence.

The library uses IChatClient for the judge, so you swap providers, add caching, or pipe through OpenTelemetry the same way you would for production traffic. (See Telemetry & Caching and Structured Output Validation.)


Code: correct vs wrong

✅ Correct: end-to-end eval pipeline

using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using Microsoft.Extensions.AI.Evaluation.Quality;
using Microsoft.Extensions.AI.Evaluation.Reporting;

// Judge: a different model from the one under test.
IChatClient judge = new OpenAIClient(judgeKey)
    .GetChatClient("gpt-4o").AsIChatClient();

var evaluators = new List<IEvaluator>
{
    new RelevanceEvaluator(),
    new GroundednessEvaluator(),
    new CoherenceEvaluator(),
    new EquivalenceEvaluator()
};

var configuration = new ChatConfiguration(judge);

// Run system under test on each example, then score with every evaluator.
var results = new List<EvaluationResult>();
foreach (var example in goldenSet)
{
    ChatResponse response = await sut.GetResponseAsync(example.Input);

    var ctx = new EvaluationContext
    {
        UserInput = example.Input,
        Response = response.Messages.Last(),
        BaselineResponse = example.Expected,    // for EquivalenceEvaluator
        Context = example.RetrievedContext       // for GroundednessEvaluator
    };

    foreach (var ev in evaluators)
    {
        var metric = await ev.EvaluateAsync(ctx, configuration);
        results.Add(metric);
    }
}

// Reporter writes JSON + HTML; CI consumes the JSON.
var reporter = new DiskBasedResultStore("./eval-results");
await reporter.WriteResultsAsync("pr-1234", results);

❌ Wrong: judge = generator

// The model under test is also the judge. Scores will be optimistic.
IChatClient generator = openAi.GetChatClient("gpt-4o").AsIChatClient();
IChatClient judge     = openAi.GetChatClient("gpt-4o").AsIChatClient();

Always pin the judge to a different model family or version, and document the pairing.

❌ Wrong: no baseline comparison

// We computed scores. We didn't compare them to last week. CI is decorative.
foreach (var r in results) Console.WriteLine($"{r.Name}: {r.Score}");

✅ Correct: regression gate

var current  = await store.LoadAsync("pr-1234");
var baseline = await store.LoadAsync("main-latest");

foreach (var metric in current.GroupBy(r => r.Name))
{
    var avgNow  = metric.Average(m => m.Score);
    var avgBase = baseline.Where(r => r.Name == metric.Key).Average(m => m.Score);

    if (avgNow < avgBase - 0.15)   // 0.15 on a 1-5 scale ~3% regression
        Environment.Exit(1);        // fail the PR
}

Design patterns for this topic

Pattern 1 — "PR gate (cheap, deterministic, fast)"

  • Intent: every PR runs golden set (~100 inputs) × 3 evaluators in <2 minutes.
  • Tactics: small dataset, cached judge calls keyed on (input_hash, response_hash, judge_model), hard score thresholds.

Pattern 2 — "Nightly heavy suite"

  • Intent: broad coverage and trend tracking.
  • Tactics: 10k production replay + adversarial set, all evaluators, results piped to Power BI / Datadog. Alerts on week-over-week regression.

Pattern 3 — "Pre-release red team"

  • Intent: catch jailbreaks and prompt-injection before launch.
  • Tactics: curated adversarial set + custom RefusalEvaluator that scores "did the model correctly refuse?".

Pattern 4 — "Custom domain evaluator"

  • Intent: generic evaluators do not cover business correctness.
  • Tactics: implement IEvaluator with a domain rubric. Example: a mortgage app's LegalDisclaimerPresentEvaluator.

Pattern 5 — "Judge ensemble + majority vote"

  • Intent: kill single-judge bias.
  • Tactics: run three judges (GPT-4o, Claude Sonnet, Gemini Pro), take the median per metric. ~3× cost, materially less variance.

Pattern 6 — "Eval-as-feature-flag gate"

  • Intent: ship a new prompt/model behind a flag, eval continuously, auto-rollback on regression.
  • Tactics: see Feature Flags. Flag toggles the new path for 5% of traffic. Eval metrics on production samples flow into a dashboard. If groundedness drops >0.2, the flag flips off automatically.

Pros & cons / trade-offs

Aspect Pros Cons
Built-in evaluators Free, zero custom prompt work Generic; may not match your domain rubric
LLM-as-judge Scales to any rubric; reference-free Costs tokens; non-deterministic without temp=0
BLEU / ROUGE Cheap, deterministic Surface-form only; misses paraphrase quality
Custom evaluator Domain-precise You own the prompt and the failure modes
Judge ensemble Reduces bias 3× cost, more orchestration
CI gate Catches regressions in PR Slow PRs if dataset too big; flaky if temp>0

When to use / when to avoid

Use when: - You have any LLM feature in production or moving toward it. - You are about to switch models, prompts, or retrieval — you need a baseline to compare against. - You ship to regulated users (medical, legal, financial) and need an audit trail of quality scores.

Avoid when: - You have no users yet and no golden set; build the dataset first, the harness second. - The feature is purely deterministic (function-call routing with strict JSON schema and unit tests already cover it). - Cost is the binding constraint and the LLM-judge bill exceeds the production inference bill — sample harder or move to cheaper judges.


Interview Q&A

1. Why have an eval harness when you have unit tests? Unit tests verify code paths; an eval harness verifies answers. With LLMs the failure mode is "the answer is plausible but wrong" — unreachable by assertion-style tests. The harness gives you a numeric quality signal that moves with prompt changes, model swaps, retrieval tweaks, and temperature. Without it you are flying blind every time you bump a model version.

2. Why is self-evaluation a problem and how do you fix it? Models score their own outputs higher than independent judges do — typically 3-7 points on a 1-5 rubric. The fix is to enforce a different model family or version for the judge, and document the pairing. For high-stakes scoring, ensemble three judges from different vendors and take the median; bias drops materially and variance flattens.

3. EquivalenceEvaluator vs GroundednessEvaluator — when each? EquivalenceEvaluator needs a gold answer and asks "does the response say the same thing?". Use it on curated golden sets where you control the truth. GroundednessEvaluator needs the retrieved context and asks "is the response supported by the context?". Use it for RAG, where you do not have one canonical answer but you do have the source docs. Most RAG suites run both: equivalence for the curated 100, groundedness for everything.

4. How do you keep eval cost bounded when you replay 10k production queries? Three levers. First, sample: stratify by feature/tenant and grab 500-2000 representative queries instead of all 10k. Second, cache: key judge calls on (input_hash, response_hash, judge_model_version) so re-runs after a non-prompt change are free. Third, tier the suite: cheap evaluators (exact-match, regex refusal) on the full set, expensive LLM-judge on a 10% sample.

5. The CI gate keeps flapping — what knobs do you turn? Flapping = noise, not signal. Lower temperature on the judge (temperature=0, seed=42 if available). Average over multiple runs of the same example. Widen the regression threshold to 2× the historical variance. Pin the judge to an exact model snapshot (gpt-4o-2024-11-20, not gpt-4o) so silent vendor upgrades do not move scores under you.

6. How do you build a golden set from scratch? Mine real or hypothetical user queries (Slack threads, support tickets, design-doc examples). Have two domain SMEs answer each independently. Reconcile disagreements — that disagreement is the rubric calibration. 50-100 examples is enough to start; grow to 300-500 over a quarter. Tag each example with a feature/topic so you can slice scores later.

7. RAGAS vs Microsoft.Extensions.AI.Evaluation vs Promptfoo vs OpenAI Evals — when each? RAGAS is Python-first, RAG-specialized (faithfulness, answer_relevance, context_precision). Promptfoo is a CLI/YAML harness with great UX for prompt iteration, multi-provider. OpenAI Evals is OpenAI's eval framework, model-agnostic but written for their workflows. Microsoft.Extensions.AI.Evaluation is the .NET-native option built on IChatClient, integrates with your DI/OTel/cache pipeline, and is the right pick if you ship .NET. Use whichever, but use one.

8. What is a custom evaluator and when do you write one? You write one when no built-in metric captures the thing you actually care about. Implementation is IEvaluator.EvaluateAsync(EvaluationContext, ChatConfiguration) returning EvaluationResult. Inside you craft a rubric prompt, ask the judge for structured JSON output, parse it. Examples shipped to production: LegalDisclaimerPresentEvaluator, JsonSchemaConformanceEvaluator, LanguageMatchesUserLocaleEvaluator, RefusalCorrectnessEvaluator.

9. Why structured output for the judge? Free-form judge output is unparseable at scale. Force JSON with ChatResponseFormat.ForJsonSchema(...) so you reliably extract { score: 4, reasoning: "...", evidence_spans: [...] }. The reasoning field is gold for debugging — it tells you why the judge scored low.

10. How do you correlate offline eval scores with production user satisfaction? Instrument both. Log eval scores per release. Log user thumbs-up/down or implicit signals (regenerate-rate, time-to-accept) per session. Plot the two over time. If they diverge — eval up, user satisfaction flat — your eval set has drifted from the real distribution. Refresh the dataset from production replay.

11. How do you eval a multi-turn conversation, not just a single response? The EvaluationContext accepts the full message history, so you can score the final assistant turn in the context of everything that came before. For multi-turn quality (e.g., "did the assistant correctly remember the user's earlier constraint?") you write a custom ConversationCoherenceEvaluator that grades the whole transcript on a rubric like "consistency", "resolution", "task completion". Single-turn evaluators on the last turn miss this entirely.

12. The judge gives high scores but production users complain. What's wrong? Three usual suspects. (a) Eval set drift — your golden set looks nothing like real queries; build production replay. (b) Judge style bias — your judge prefers verbose, hedged answers; your users want crisp ones. Tune the rubric and add a BrevityEvaluator. © You measure the wrong dimension — high relevance but low groundedness means hallucinations users notice but the relevance score doesn't catch. Always run the full evaluator suite, not just one.


Gotchas / common mistakes

  • ⚠️ Judge = generator. Scores look great in dev, regress hard in production. Always cross-model.
  • ⚠️ No fixed seed / temperature on the judge. Re-runs disagree by 0.3-0.5 points; you cannot tell signal from noise.
  • ⚠️ Eval set built from synthetic queries only. The model that wrote them is also being graded. Confirmation bias bakes in.
  • ⚠️ No baseline persistence. You compute scores per PR but never compare across PRs — the gate is decorative.
  • ⚠️ Logging the judge's full transcript without redaction. Customer prompts contain PII. Same rules as production logging — EnableSensitiveData=false by default.
  • ⚠️ Eval cost > production cost. You ran the heavy suite per commit instead of nightly. Tier the suite.
  • ⚠️ Pinning to a moving model alias (gpt-4o). Vendor silently upgrades it; your scores move; you blame your prompt change. Pin to dated snapshots.
  • ⚠️ One evaluator only. Relevance is high but groundedness is low — the answer is on-topic and hallucinated. Always run multiple metrics.
  • ⚠️ Stripping the judge's reasoning before saving. That free-text reasoning is the single most useful debugging artifact when a metric regresses. Persist it.
  • ⚠️ Running eval only in CI, never in production. Sampled production-traffic eval catches the drift CI cannot — real queries don't look like your golden set after a year.

Further reading

API note: Microsoft.Extensions.AI.Evaluation and Microsoft.Extensions.AI.Evaluation.Quality are the package IDs. The judge is any IChatClient, so the harness gets caching, OTel and structured output for free. (Verify package versions on Microsoft Learn before pinning.)


Cross-references