Skip to content

Benchmarks & Evals

Key Points

  • Public benchmarks measure narrow capabilities; not your domain.
  • Common benchmarks: MMLU (general), HumanEval (code), GPQA (PhD science), MATH, SWE-bench (coding agents), AIME (math).
  • Vendor leaderboards lie a bit — picked configs, contamination, cherry-picked.
  • Build your own eval set for production. Domain examples + ground-truth answers + automated grader.
  • Tools: Ragas (RAG), LangSmith, Promptfoo, Microsoft PromptFlow, manual review.

Public benchmarks (what they actually measure)

Benchmark Tests
MMLU Multi-domain knowledge (57 subjects). Now saturated.
MMLU-Pro Harder MMLU; less saturated.
GPQA PhD-level science. Strong differentiator.
HumanEval Code completion (small Python tasks). Saturated.
MBPP Code (small Python).
SWE-bench Real GitHub issues; coding agents. Hard.
MATH Competition math.
AIME American Invitational Mathematics Exam.
HellaSwag Common sense.
ARC-AGI Abstract reasoning. AGI-ish hard.
BIG-bench Hard 23 hard tasks.
MT-Bench Open-ended multi-turn dialogue. LLM-as-judge.

Caveats

  • Contamination: training data may have included benchmark answers.
  • Prompt sensitivity: same model, different prompt → different scores.
  • Cherry-picked configs: vendors run with optimal prompts.
  • Saturation: benchmarks lose discriminating power as models improve.

Vendor leaderboards

  • LMSYS Chatbot Arena: human pairwise voting. Most credible aggregate.
  • OpenLLM Leaderboard (HuggingFace): for open weights.
  • MTEB: embeddings.
  • HumanEval+, BigCodeBench: code.

Treat as starting point, not gospel.

Build your own eval

public record EvalCase(string Name, string Input, string ExpectedSubstring, double MaxLatencyMs);

var cases = new[]
{
    new EvalCase("Refund policy", "What's the refund policy?", "30 days", 2000),
    new EvalCase("Multi-language", "Cómo está?", "Bien", 2000),
    /* 50-200 cases */
};

foreach (var c in cases)
{
    var sw = Stopwatch.StartNew();
    var resp = await chat.GetResponseAsync(c.Input);
    var pass = resp.Text!.Contains(c.ExpectedSubstring) && sw.ElapsedMilliseconds < c.MaxLatencyMs;
    /* record */
}

Run nightly; track regressions.

RAG-specific evals

Faithfulness

Does the answer cite real sources from the retrieved context?

Answer: "The refund period is 30 days [Doc-1]."
Doc-1: "Customers may return items within 30 days..."
→ Faithful.

Answer: "We offer extended warranty [Doc-1]."
Doc-1: (about returns)
→ Hallucinated. Failure.

Relevance

Is the retrieval actually relevant to the query?

Precision@K

Of top-K retrieved, how many are relevant?

Recall@K

Of all relevant docs in corpus, how many in top-K?

LLM-as-judge

Use a separate (often more capable) LLM to grade.

var judge = _gpt4o;
var prompt = $$"""
Question: {{q}}
Answer: {{a}}
Reference: {{expected}}
Rate the answer 1-5 for accuracy and explain.
""";
var grade = await judge.GetResponseAsync(prompt);

Caveats: bias toward longer / verbose answers; same-model judging itself is biased.

Eval frameworks

Tool Notes
Ragas RAG-specific; faithfulness, answer relevancy
PromptFlow (Microsoft) Azure-integrated; GUI
LangSmith (LangChain) Hosted observability + eval
Promptfoo OSS; CLI; CI-friendly
DeepEval OSS; pytest-style

CI integration

- name: Run evals
  run: dotnet test --filter Category=Eval
- name: Compare to baseline
  run: dotnet run --project Evaluator -- --baseline=main --threshold=0.95
- name: Fail if regression
  if: failure()

Block merges that regress quality.

A/B testing in production

Roll out new model to N% of traffic; track key metrics; promote if better.

// Feature flag picks model
var chat = featureFlag.IsEnabled("gpt5-rollout") ? _gpt5 : _gpt4o;

Track: NPS, complaint rate, escalation rate.

Senior considerations

  • Don't trust benchmarks: build your own test set early.
  • Eval quality matters more than benchmark scores: bad evals → wrong model picks.
  • Human review of edge cases: automated graders miss subtle failures.
  • Track over time: log per-eval-case scores; spot regressions across model versions.

Anti-patterns

  • ❌ "Vendor X said they're #1 on MMLU" — pick model based on YOUR data.
  • ❌ Eval suite of 5 cases — not statistically meaningful.
  • LLM-as-judge with same model — biased.
  • ❌ Manual eval only — doesn't scale.

Cross-references