Model Selection Decision Matrix
Key Points
Pick model by workload axis: latency, cost, context, function calling, structured output, vision, code, reasoning. Top contenders rotate as models evolve — re-evaluate quarterly.
Decision matrix
By criteria
| Criterion | Top picks (2026) |
|---|---|
| Lowest cost / best value | gpt-4o-mini, Claude Haiku, Gemini 2.5 Flash, Llama 3.x via Groq |
| Highest quality | GPT-5, Claude 4 Opus, Gemini 2.5 Pro, o3 |
| Code generation | Claude 4 Sonnet, GPT-5, o3, DeepSeek Coder |
| Long context | Claude 4 (1M), Gemini 2.5 Pro (2M) |
| Math / logic | o3, DeepSeek R1, Claude 4 Opus extended thinking |
| Function calling | OpenAI (strict mode mature), Claude (mature), Gemini (improving) |
| Structured output | OpenAI (JSON Schema strict), Claude (tool use → JSON) |
| Multimodal text+image | GPT-4o, Claude 4, Gemini 2.5 |
| Multimodal text+audio | GPT-4o, Gemini 2.5 |
| Multimodal text+video | Gemini 2.5 |
| Real-time / low latency | gpt-4o-mini, Groq-hosted Llama, Claude Haiku |
| Reasoning (CoT) | o3, DeepSeek R1, Claude 4 with extended thinking |
| Embedding quality | text-embedding-3-large, voyage-3 |
| Embedding cost | text-embedding-3-small |
| Code embedding | voyage-3-code |
| Reranker | Cohere Rerank, Jina |
| Vision OCR | Claude 4, Gemini 2.5 |
| Image gen | GPT-Image (DALL-E 3 successor), FLUX (open) |
| TTS / voice | OpenAI TTS, ElevenLabs |
| Speech-to-text | Whisper, Azure Speech |
| On-device / edge | Phi-¾, Llama 3 small via ONNX/Ollama |
| Privacy / sovereignty | Self-hosted Llama, Phi via ONNX |
| Specialty: legal docs | voyage-law-2 (embeddings) + Claude 4 |
| Specialty: medical | Frontier with domain RAG (skip Med-PaLM unless specific) |
By cost tier
Free / cheap: gpt-4o-mini ($0.15/M)
Claude Haiku ($0.25/M)
Gemini Flash ($0.30/M)
Llama 3 70B via Groq ($0.20/M)
Mid-cost: Claude 4 Sonnet ($3/M)
GPT-4o ($2.50/M)
Gemini 2.5 Pro ($3.50/M)
Frontier: GPT-5 ($15-30/M)
Claude 4 Opus ($15/M)
o3 ($20+/M)
By context window
~8-32K: most older models, Phi
~128K: GPT-4o, GPT-4-turbo
~200K: Claude 4 default
~1M: Claude 4 extended, Gemini 2.5 Pro
~2M: Gemini 2.5 Pro extended
By latency at p99
Fastest: Groq Llama (~500 tok/s), Gemini Flash, gpt-4o-mini
Mid: GPT-4o, Claude Sonnet
Slow: Claude Opus, GPT-5, reasoning models (o3 etc.)
Reasoning models output thinking tokens before answer — high effective latency.
Decision tree
Need offline / privacy?
└→ YES: Phi via ONNX or Llama via Ollama
└→ NO:
Need video processing?
└→ YES: Gemini 2.5 Pro
└→ NO:
Long doc (>200K)?
└→ YES: Claude 4 (1M) or Gemini 2.5 Pro (2M)
└→ NO:
Hard math/logic?
└→ YES: o3 / DeepSeek R1
└→ NO:
Cost-sensitive volume?
└→ YES: gpt-4o-mini / Claude Haiku / Gemini Flash
└→ NO:
Code-heavy?
└→ YES: Claude 4 Sonnet
└→ NO: GPT-4o (default)
Multi-model routing
Production apps often route per request:
public IChatClient Pick(string query)
{
if (RequiresReasoning(query)) return _o3;
if (IsLongDoc(query)) return _claude;
if (IsSimpleQuery(query)) return _gpt4oMini;
return _gpt4o;
}
Trade complexity for cost / quality balance.
Benchmarks vs reality
MMLU, HumanEval are public. Your domain quality can differ wildly. Always evaluate on your data.
Tools: Ragas (RAG eval), LLM-as-judge, expert review.
Senior considerations
- Default to one model; promote after measurement.
- Multi-vendor for resilience: don't depend on a single provider.
- Track usage by route — find expensive cohorts.
- Re-evaluate quarterly — model quality + pricing change fast.
- Test edge cases — frontier models still fail on weird inputs.
Anti-patterns
- ❌ "GPT-5 for everything" — overkill cost.
- ❌ "Cheapest for everything" — quality issues.
- ❌ "Reasoning model for chitchat" — wasteful.
- ❌ "No measurement, just feels" — confirmation bias.