Model Selection Decision Matrix

Key Points

Pick model by workload axis: latency, cost, context, function calling, structured output, vision, code, reasoning. Top contenders rotate as models evolve — re-evaluate quarterly.

Decision matrix

By criteria

Criterion	Top picks (2026)
Lowest cost / best value	gpt-4o-mini, Claude Haiku, Gemini 2.5 Flash, Llama 3.x via Groq
Highest quality	GPT-5, Claude 4 Opus, Gemini 2.5 Pro, o3
Code generation	Claude 4 Sonnet, GPT-5, o3, DeepSeek Coder
Long context	Claude 4 (1M), Gemini 2.5 Pro (2M)
Math / logic	o3, DeepSeek R1, Claude 4 Opus extended thinking
Function calling	OpenAI (strict mode mature), Claude (mature), Gemini (improving)
Structured output	OpenAI (JSON Schema strict), Claude (tool use → JSON)
Multimodal text+image	GPT-4o, Claude 4, Gemini 2.5
Multimodal text+audio	GPT-4o, Gemini 2.5
Multimodal text+video	Gemini 2.5
Real-time / low latency	gpt-4o-mini, Groq-hosted Llama, Claude Haiku
Reasoning (CoT)	o3, DeepSeek R1, Claude 4 with extended thinking
Embedding quality	text-embedding-3-large, voyage-3
Embedding cost	text-embedding-3-small
Code embedding	voyage-3-code
Reranker	Cohere Rerank, Jina
Vision OCR	Claude 4, Gemini 2.5
Image gen	GPT-Image (DALL-E 3 successor), FLUX (open)
TTS / voice	OpenAI TTS, ElevenLabs
Speech-to-text	Whisper, Azure Speech
On-device / edge	Phi-¾, Llama 3 small via ONNX/Ollama
Privacy / sovereignty	Self-hosted Llama, Phi via ONNX
Specialty: legal docs	voyage-law-2 (embeddings) + Claude 4
Specialty: medical	Frontier with domain RAG (skip Med-PaLM unless specific)

By cost tier

Free / cheap:  gpt-4o-mini ($0.15/M)
                Claude Haiku ($0.25/M)
                Gemini Flash ($0.30/M)
                Llama 3 70B via Groq ($0.20/M)

Mid-cost:       Claude 4 Sonnet ($3/M)
                GPT-4o ($2.50/M)
                Gemini 2.5 Pro ($3.50/M)

Frontier:       GPT-5 ($15-30/M)
                Claude 4 Opus ($15/M)
                o3 ($20+/M)

By context window

~8-32K:  most older models, Phi
~128K:   GPT-4o, GPT-4-turbo
~200K:   Claude 4 default
~1M:     Claude 4 extended, Gemini 2.5 Pro
~2M:     Gemini 2.5 Pro extended

By latency at p99

Fastest:  Groq Llama (~500 tok/s), Gemini Flash, gpt-4o-mini
Mid:      GPT-4o, Claude Sonnet
Slow:     Claude Opus, GPT-5, reasoning models (o3 etc.)

Reasoning models output thinking tokens before answer — high effective latency.

Decision tree

Need offline / privacy?
  └→ YES: Phi via ONNX or Llama via Ollama
  └→ NO:
     Need video processing?
       └→ YES: Gemini 2.5 Pro
       └→ NO:
          Long doc (>200K)?
            └→ YES: Claude 4 (1M) or Gemini 2.5 Pro (2M)
            └→ NO:
               Hard math/logic?
                 └→ YES: o3 / DeepSeek R1
                 └→ NO:
                    Cost-sensitive volume?
                      └→ YES: gpt-4o-mini / Claude Haiku / Gemini Flash
                      └→ NO:
                         Code-heavy?
                           └→ YES: Claude 4 Sonnet
                           └→ NO: GPT-4o (default)

Multi-model routing

Production apps often route per request:

public IChatClient Pick(string query)
{
    if (RequiresReasoning(query)) return _o3;
    if (IsLongDoc(query)) return _claude;
    if (IsSimpleQuery(query)) return _gpt4oMini;
    return _gpt4o;
}

Trade complexity for cost / quality balance.

Benchmarks vs reality

MMLU, HumanEval are public. Your domain quality can differ wildly. Always evaluate on your data.

Tools: Ragas (RAG eval), LLM-as-judge, expert review.

Senior considerations

Default to one model; promote after measurement.
Multi-vendor for resilience: don't depend on a single provider.
Track usage by route — find expensive cohorts.
Re-evaluate quarterly — model quality + pricing change fast.
Test edge cases — frontier models still fail on weird inputs.

Anti-patterns

❌ "GPT-5 for everything" — overkill cost.
❌ "Cheapest for everything" — quality issues.
❌ "Reasoning model for chitchat" — wasteful.
❌ "No measurement, just feels" — confirmation bias.