Benchmarks & Evals

Key Points

Public benchmarks measure narrow capabilities; not your domain.
Common benchmarks: MMLU (general), HumanEval (code), GPQA (PhD science), MATH, SWE-bench (coding agents), AIME (math).
Vendor leaderboards lie a bit — picked configs, contamination, cherry-picked.
Build your own eval set for production. Domain examples + ground-truth answers + automated grader.
Tools: Ragas (RAG), LangSmith, Promptfoo, Microsoft PromptFlow, manual review.

Public benchmarks (what they actually measure)

Benchmark	Tests
MMLU	Multi-domain knowledge (57 subjects). Now saturated.
MMLU-Pro	Harder MMLU; less saturated.
GPQA	PhD-level science. Strong differentiator.
HumanEval	Code completion (small Python tasks). Saturated.
MBPP	Code (small Python).
SWE-bench	Real GitHub issues; coding agents. Hard.
MATH	Competition math.
AIME	American Invitational Mathematics Exam.
HellaSwag	Common sense.
ARC-AGI	Abstract reasoning. AGI-ish hard.
BIG-bench Hard	23 hard tasks.
MT-Bench	Open-ended multi-turn dialogue. LLM-as-judge.

Caveats

Contamination: training data may have included benchmark answers.
Prompt sensitivity: same model, different prompt → different scores.
Cherry-picked configs: vendors run with optimal prompts.
Saturation: benchmarks lose discriminating power as models improve.

Vendor leaderboards

LMSYS Chatbot Arena: human pairwise voting. Most credible aggregate.
OpenLLM Leaderboard (HuggingFace): for open weights.
MTEB: embeddings.
HumanEval+, BigCodeBench: code.

Treat as starting point, not gospel.

Build your own eval

public record EvalCase(string Name, string Input, string ExpectedSubstring, double MaxLatencyMs);

var cases = new[]
{
    new EvalCase("Refund policy", "What's the refund policy?", "30 days", 2000),
    new EvalCase("Multi-language", "Cómo está?", "Bien", 2000),
    /* 50-200 cases */
};

foreach (var c in cases)
{
    var sw = Stopwatch.StartNew();
    var resp = await chat.GetResponseAsync(c.Input);
    var pass = resp.Text!.Contains(c.ExpectedSubstring) && sw.ElapsedMilliseconds < c.MaxLatencyMs;
    /* record */
}

Run nightly; track regressions.

RAG-specific evals

Faithfulness

Does the answer cite real sources from the retrieved context?

Answer: "The refund period is 30 days [Doc-1]."
Doc-1: "Customers may return items within 30 days..."
→ Faithful.

Answer: "We offer extended warranty [Doc-1]."
Doc-1: (about returns)
→ Hallucinated. Failure.

Relevance

Is the retrieval actually relevant to the query?

Precision@K

Of top-K retrieved, how many are relevant?

Recall@K

Of all relevant docs in corpus, how many in top-K?

LLM-as-judge

Use a separate (often more capable) LLM to grade.

var judge = _gpt4o;
var prompt = $$"""
Question: {{q}}
Answer: {{a}}
Reference: {{expected}}
Rate the answer 1-5 for accuracy and explain.
""";
var grade = await judge.GetResponseAsync(prompt);

Caveats: bias toward longer / verbose answers; same-model judging itself is biased.

Eval frameworks

Tool	Notes
Ragas	RAG-specific; faithfulness, answer relevancy
PromptFlow (Microsoft)	Azure-integrated; GUI
LangSmith (LangChain)	Hosted observability + eval
Promptfoo	OSS; CLI; CI-friendly
DeepEval	OSS; pytest-style

CI integration

- name: Run evals
  run: dotnet test --filter Category=Eval
- name: Compare to baseline
  run: dotnet run --project Evaluator -- --baseline=main --threshold=0.95
- name: Fail if regression
  if: failure()

Block merges that regress quality.

A/B testing in production

Roll out new model to N% of traffic; track key metrics; promote if better.

// Feature flag picks model
var chat = featureFlag.IsEnabled("gpt5-rollout") ? _gpt5 : _gpt4o;

Track: NPS, complaint rate, escalation rate.

Senior considerations

Don't trust benchmarks: build your own test set early.
Eval quality matters more than benchmark scores: bad evals → wrong model picks.
Human review of edge cases: automated graders miss subtle failures.
Track over time: log per-eval-case scores; spot regressions across model versions.

Anti-patterns

❌ "Vendor X said they're #1 on MMLU" — pick model based on YOUR data.
❌ Eval suite of 5 cases — not statistically meaningful.
❌ LLM-as-judge with same model — biased.
❌ Manual eval only — doesn't scale.