Overview: The 2026 Model Market

Key Points

Frontier closed-source: OpenAI (GPT-5 / o-series), Anthropic (Claude 4 family), Google (Gemini 2.5).
Strong open-weights: Meta Llama ¾, Mistral, DeepSeek (R1 reasoning), Qwen, Alibaba.
Small + edge: Microsoft Phi-3/Phi-4, Llama small, Gemma.
Specialty: embeddings (text-embedding-3, voyage-3, Cohere), rerankers, vision, audio.
Multi-vendor strategy is the senior default — pin to provider only after measurement; abstract via IChatClient.

Frontier model families (2026)

OpenAI

GPT-5 family: flagship reasoning + multimodal.
GPT-4o / 4o-mini: cost-balanced; 128K context.
o-series (o1, o3): reasoning models — explicit chain-of-thought; better math/code/logic.
Embedding: text-embedding-3-small / -large.
Whisper: audio transcription.

Anthropic

Claude 4 Opus / Sonnet / Haiku: tier of capability/cost.
1M context in select models.
Computer use (Claude operates a browser/computer).
Long-context strengths — extended reasoning over big docs.

Google

Gemini 2.5 Pro / Flash: multimodal native (text, image, video, audio).
2M context in Pro tier.
Vertex AI for managed access; AI Studio for direct API.

Open weights

Meta Llama 3.x / 4: state-of-art open; self-hosted or via Together/Groq/Replicate.
Mistral: French; competitive small models (Mistral Large; Mixtral MoE).
DeepSeek R1: reasoning-focused; open weights.
Qwen (Alibaba): strong Chinese + multilingual; competitive English.

Microsoft Phi (small)

Phi-3 / Phi-4: SLMs (Small Language Models). 3-14B params.
Edge / on-device via ONNX Runtime GenAI.
"Surprisingly good" for size.

What "best" means depends on workload

Workload	Top contenders
General Q&A	GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash
Coding	Claude 4 Sonnet, GPT-5, o-series, DeepSeek-Coder
Math / logic	o-series, DeepSeek R1, Claude 4 Opus
Long context	Claude 4 (1M), Gemini 2.5 Pro (2M)
Cost-sensitive	gpt-4o-mini, Claude Haiku, Gemini Flash, Llama 3.x
On-device / privacy	Phi-¾, Llama small via Ollama
Multimodal	GPT-4o (text+image+audio), Gemini 2.5 (everything), Claude 4 (vision)

See Model Selection Decision Matrix.

Pricing (rough, 2026)

GPT-5:                $15-30 / M input  (most expensive)
GPT-4o:               $2.50 / M input
GPT-4o-mini:          $0.15 / M input  (very cheap)
Claude 4 Opus:        $15 / M input
Claude 4 Sonnet:      $3 / M input
Claude 4 Haiku:       $0.25 / M input
Gemini 2.5 Pro:       $3.50 / M input
Gemini 2.5 Flash:     $0.30 / M input
Llama 3 (Together):   $0.20-1 / M input
Self-hosted Llama:    fixed GPU $$/hr

Output ~3-5x input. Prompt caching (Anthropic, OpenAI) gives huge discount on repeated prefixes. See Pricing & Cost Engineering.

Capability tiers (informal)

Tier S (frontier flagships):   GPT-5, Claude 4 Opus, Gemini 2.5 Pro, o3
Tier A (production workhorses): GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash, Llama 3.3
Tier B (cost-sensitive smart): GPT-4o-mini, Claude Haiku, Gemini Flash 2.5, Llama 3.x small
Tier C (small/edge):           Phi-4, Llama 3 8B, Mistral 7B

Reasoning models

OpenAI o-series, DeepSeek R1, Claude with extended thinking. Trained for explicit chain-of-thought. Slower per token; better on hard reasoning. Different cost profile (more output tokens for thinking).

When to reach for reasoning: math, complex coding, multi-step planning. Don't use for simple Q&A — wasted cost.

Multimodal

Model	Inputs
GPT-4o	text, image, audio
Claude 4	text, image, "computer use"
Gemini 2.5	text, image, video, audio
Llama 3 Vision	text, image

For vision-heavy: GPT-4o or Gemini 2.5. For document images: Claude 4 (strong OCR-like ability).

Self-hosted vs API

Path	When
API (managed)	Default. Cost, ops, model quality, scaling
Self-hosted Llama / Mistral	High volume + cost-sensitive; data sovereignty
Edge (Ollama, Phi via ONNX)	Privacy; offline

For most teams: API. Self-host only with measured cost win + ops capacity.

Geographic / compliance

Azure OpenAI: in your Azure region, data residency.
AWS Bedrock: hosts Anthropic, Meta, Mistral, Amazon Nova.
GCP Vertex AI: Gemini, Anthropic, others.
Oracle: hosts some.
Sovereignty regions: gov, EU, China.

Model cycle

Models age fast. Today's flagship → tomorrow's cheap commodity. Build vendor-portable code (IChatClient) so you can switch.

Senior strategy

Default to a frontier provider (Azure OpenAI typically).
Prototype with cheap tier (gpt-4o-mini).
Measure quality / latency / cost.
Promote to expensive only where measurably better.
Multi-provider fallback for resilience.
Keep abstraction layer (IChatClient).