Benchmarks & Evals
Key Points
- Public benchmarks measure narrow capabilities; not your domain.
- Common benchmarks: MMLU (general), HumanEval (code), GPQA (PhD science), MATH, SWE-bench (coding agents), AIME (math).
- Vendor leaderboards lie a bit — picked configs, contamination, cherry-picked.
- Build your own eval set for production. Domain examples + ground-truth answers + automated grader.
- Tools: Ragas (RAG), LangSmith, Promptfoo, Microsoft PromptFlow, manual review.
Public benchmarks (what they actually measure)
| Benchmark | Tests |
|---|---|
| MMLU | Multi-domain knowledge (57 subjects). Now saturated. |
| MMLU-Pro | Harder MMLU; less saturated. |
| GPQA | PhD-level science. Strong differentiator. |
| HumanEval | Code completion (small Python tasks). Saturated. |
| MBPP | Code (small Python). |
| SWE-bench | Real GitHub issues; coding agents. Hard. |
| MATH | Competition math. |
| AIME | American Invitational Mathematics Exam. |
| HellaSwag | Common sense. |
| ARC-AGI | Abstract reasoning. AGI-ish hard. |
| BIG-bench Hard | 23 hard tasks. |
| MT-Bench | Open-ended multi-turn dialogue. LLM-as-judge. |
Caveats
- Contamination: training data may have included benchmark answers.
- Prompt sensitivity: same model, different prompt → different scores.
- Cherry-picked configs: vendors run with optimal prompts.
- Saturation: benchmarks lose discriminating power as models improve.
Vendor leaderboards
- LMSYS Chatbot Arena: human pairwise voting. Most credible aggregate.
- OpenLLM Leaderboard (HuggingFace): for open weights.
- MTEB: embeddings.
- HumanEval+, BigCodeBench: code.
Treat as starting point, not gospel.
Build your own eval
public record EvalCase(string Name, string Input, string ExpectedSubstring, double MaxLatencyMs);
var cases = new[]
{
new EvalCase("Refund policy", "What's the refund policy?", "30 days", 2000),
new EvalCase("Multi-language", "Cómo está?", "Bien", 2000),
/* 50-200 cases */
};
foreach (var c in cases)
{
var sw = Stopwatch.StartNew();
var resp = await chat.GetResponseAsync(c.Input);
var pass = resp.Text!.Contains(c.ExpectedSubstring) && sw.ElapsedMilliseconds < c.MaxLatencyMs;
/* record */
}
Run nightly; track regressions.
RAG-specific evals
Faithfulness
Does the answer cite real sources from the retrieved context?
Answer: "The refund period is 30 days [Doc-1]."
Doc-1: "Customers may return items within 30 days..."
→ Faithful.
Answer: "We offer extended warranty [Doc-1]."
Doc-1: (about returns)
→ Hallucinated. Failure.
Relevance
Is the retrieval actually relevant to the query?
Precision@K
Of top-K retrieved, how many are relevant?
Recall@K
Of all relevant docs in corpus, how many in top-K?
LLM-as-judge
Use a separate (often more capable) LLM to grade.
var judge = _gpt4o;
var prompt = $$"""
Question: {{q}}
Answer: {{a}}
Reference: {{expected}}
Rate the answer 1-5 for accuracy and explain.
""";
var grade = await judge.GetResponseAsync(prompt);
Caveats: bias toward longer / verbose answers; same-model judging itself is biased.
Eval frameworks
| Tool | Notes |
|---|---|
| Ragas | RAG-specific; faithfulness, answer relevancy |
| PromptFlow (Microsoft) | Azure-integrated; GUI |
| LangSmith (LangChain) | Hosted observability + eval |
| Promptfoo | OSS; CLI; CI-friendly |
| DeepEval | OSS; pytest-style |
CI integration
- name: Run evals
run: dotnet test --filter Category=Eval
- name: Compare to baseline
run: dotnet run --project Evaluator -- --baseline=main --threshold=0.95
- name: Fail if regression
if: failure()
Block merges that regress quality.
A/B testing in production
Roll out new model to N% of traffic; track key metrics; promote if better.
Track: NPS, complaint rate, escalation rate.
Senior considerations
- Don't trust benchmarks: build your own test set early.
- Eval quality matters more than benchmark scores: bad evals → wrong model picks.
- Human review of edge cases: automated graders miss subtle failures.
- Track over time: log per-eval-case scores; spot regressions across model versions.
Anti-patterns
- ❌ "Vendor X said they're #1 on MMLU" — pick model based on YOUR data.
- ❌ Eval suite of 5 cases — not statistically meaningful.
- ❌ LLM-as-judge with same model — biased.
- ❌ Manual eval only — doesn't scale.