Overview: The 2026 Model Market
Key Points
- Frontier closed-source: OpenAI (GPT-5 / o-series), Anthropic (Claude 4 family), Google (Gemini 2.5).
- Strong open-weights: Meta Llama ¾, Mistral, DeepSeek (R1 reasoning), Qwen, Alibaba.
- Small + edge: Microsoft Phi-3/Phi-4, Llama small, Gemma.
- Specialty: embeddings (text-embedding-3, voyage-3, Cohere), rerankers, vision, audio.
- Multi-vendor strategy is the senior default — pin to provider only after measurement; abstract via
IChatClient.
Frontier model families (2026)
OpenAI
- GPT-5 family: flagship reasoning + multimodal.
- GPT-4o / 4o-mini: cost-balanced; 128K context.
- o-series (o1, o3): reasoning models — explicit chain-of-thought; better math/code/logic.
- Embedding: text-embedding-3-small / -large.
- Whisper: audio transcription.
Anthropic
- Claude 4 Opus / Sonnet / Haiku: tier of capability/cost.
- 1M context in select models.
- Computer use (Claude operates a browser/computer).
- Long-context strengths — extended reasoning over big docs.
- Gemini 2.5 Pro / Flash: multimodal native (text, image, video, audio).
- 2M context in Pro tier.
- Vertex AI for managed access; AI Studio for direct API.
Open weights
- Meta Llama 3.x / 4: state-of-art open; self-hosted or via Together/Groq/Replicate.
- Mistral: French; competitive small models (Mistral Large; Mixtral MoE).
- DeepSeek R1: reasoning-focused; open weights.
- Qwen (Alibaba): strong Chinese + multilingual; competitive English.
Microsoft Phi (small)
- Phi-3 / Phi-4: SLMs (Small Language Models). 3-14B params.
- Edge / on-device via ONNX Runtime GenAI.
- "Surprisingly good" for size.
What "best" means depends on workload
| Workload | Top contenders |
|---|---|
| General Q&A | GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash |
| Coding | Claude 4 Sonnet, GPT-5, o-series, DeepSeek-Coder |
| Math / logic | o-series, DeepSeek R1, Claude 4 Opus |
| Long context | Claude 4 (1M), Gemini 2.5 Pro (2M) |
| Cost-sensitive | gpt-4o-mini, Claude Haiku, Gemini Flash, Llama 3.x |
| On-device / privacy | Phi-¾, Llama small via Ollama |
| Multimodal | GPT-4o (text+image+audio), Gemini 2.5 (everything), Claude 4 (vision) |
See Model Selection Decision Matrix.
Pricing (rough, 2026)
GPT-5: $15-30 / M input (most expensive)
GPT-4o: $2.50 / M input
GPT-4o-mini: $0.15 / M input (very cheap)
Claude 4 Opus: $15 / M input
Claude 4 Sonnet: $3 / M input
Claude 4 Haiku: $0.25 / M input
Gemini 2.5 Pro: $3.50 / M input
Gemini 2.5 Flash: $0.30 / M input
Llama 3 (Together): $0.20-1 / M input
Self-hosted Llama: fixed GPU $$/hr
Output ~3-5x input. Prompt caching (Anthropic, OpenAI) gives huge discount on repeated prefixes. See Pricing & Cost Engineering.
Capability tiers (informal)
Tier S (frontier flagships): GPT-5, Claude 4 Opus, Gemini 2.5 Pro, o3
Tier A (production workhorses): GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash, Llama 3.3
Tier B (cost-sensitive smart): GPT-4o-mini, Claude Haiku, Gemini Flash 2.5, Llama 3.x small
Tier C (small/edge): Phi-4, Llama 3 8B, Mistral 7B
Reasoning models
OpenAI o-series, DeepSeek R1, Claude with extended thinking. Trained for explicit chain-of-thought. Slower per token; better on hard reasoning. Different cost profile (more output tokens for thinking).
When to reach for reasoning: math, complex coding, multi-step planning. Don't use for simple Q&A — wasted cost.
Multimodal
| Model | Inputs |
|---|---|
| GPT-4o | text, image, audio |
| Claude 4 | text, image, "computer use" |
| Gemini 2.5 | text, image, video, audio |
| Llama 3 Vision | text, image |
For vision-heavy: GPT-4o or Gemini 2.5. For document images: Claude 4 (strong OCR-like ability).
Self-hosted vs API
| Path | When |
|---|---|
| API (managed) | Default. Cost, ops, model quality, scaling |
| Self-hosted Llama / Mistral | High volume + cost-sensitive; data sovereignty |
| Edge (Ollama, Phi via ONNX) | Privacy; offline |
For most teams: API. Self-host only with measured cost win + ops capacity.
Geographic / compliance
- Azure OpenAI: in your Azure region, data residency.
- AWS Bedrock: hosts Anthropic, Meta, Mistral, Amazon Nova.
- GCP Vertex AI: Gemini, Anthropic, others.
- Oracle: hosts some.
- Sovereignty regions: gov, EU, China.
Model cycle
Models age fast. Today's flagship → tomorrow's cheap commodity. Build vendor-portable code (IChatClient) so you can switch.
Senior strategy
- Default to a frontier provider (Azure OpenAI typically).
- Prototype with cheap tier (gpt-4o-mini).
- Measure quality / latency / cost.
- Promote to expensive only where measurably better.
- Multi-provider fallback for resilience.
- Keep abstraction layer (
IChatClient).