Pricing & Cost Engineering
Key Points
- LLM pricing = input tokens × $/M + output tokens × $/M (output 3-5x input).
- Prompt caching (Anthropic, OpenAI) → ~90% discount on cached prefix; HUGE for chatbots / RAG with stable system prompt.
- Batch API (OpenAI, Anthropic): 50% discount for non-real-time.
- Levers: cheap default model, prompt cache, response cache, max_tokens cap, RAG (vs long context), embedding dim reduction.
- At scale: predictable cost requires routing, caching, monitoring, alerts.
Pricing landscape (mid-2026)
Per 1M tokens, INPUT/OUTPUT:
GPT-5: $15 / $60
GPT-4o: $2.50 / $10
GPT-4o-mini: $0.15 / $0.60
o3: $15 / $60+
o3-mini: $1.10 / $4.40
Claude 4 Opus: $15 / $75
Claude 4 Sonnet: $3 / $15
Claude 4 Haiku: $0.25 / $1.25
Gemini 2.5 Pro: $3.50 / $10.50
Gemini 2.5 Flash: $0.30 / $2.50
Llama 3.3 70B (Together): $0.20 / $0.60
Llama 3.3 70B (Groq): $0.59 / $0.79
DeepSeek R1 (DeepSeek): ~$0.55 / $2.19
text-embedding-3-small: $0.02
text-embedding-3-large: $0.13
voyage-3: $0.06
Cohere Rerank: $0.20-0.30 / 1K queries
Prices change. Always verify with vendor's pricing page.
Token math
"Hello, world!" ≈ 4 tokens
1 KB English text ≈ 250-300 tokens
1 page (500 words) ≈ 700 tokens
1M tokens ≈ ~750K English words ≈ ~1500 pages
Use vendor tokenizer libraries (tiktoken for OpenAI, etc.) to estimate accurately.
Prompt caching
Anthropic & OpenAI automatically cache prefixes that repeat across calls.
Anthropic:
// Mark stable content as cacheable
new MessageParameters
{
System = [new TextContent("Long system prompt...") { CacheControl = new() { Type = "ephemeral" } }],
/* ... */
};
90% discount on cache hit. 5-min TTL.
OpenAI: - Automatic for prompts >1024 tokens. - Same prefix → cached.
Optimization: put stable content (system prompt, RAG context) at the start. Append user query at the end.
Batch API
50% discount. For non-real-time workloads (eval, dataset generation, summarization runs).
Cost levers
1. Cheap default model
gpt-4o-mini handles 80% of requests; promote only when needed.
2. Response caching
For identical or semantically similar queries:
Microsoft.Extensions.AI has UseDistributedCache middleware. Hash prompt → cache response.
For semantic caching: hash embedding bucket → cached responses for similar.
3. RAG over long context
Don't dump entire KB into prompt. Retrieve top-K. Smaller context = cheaper + better.
4. max_tokens cap
Prevents runaway responses (especially with reasoning models).
5. Streaming + early termination
Stream output; user can stop if going wrong → save tokens.
6. Embedding optimization
- text-embedding-3-small over -large (saves 6x cost).
- Dimension reduction (1536 → 256): saves storage + search cost; minor quality loss.
- Cache embeddings for unchanged docs.
7. Quantize / cheaper rerankers
Use small open rerankers (BGE) over Cohere if quality OK.
8. Self-host break-even
Break-even typically >10M tokens/day on dense workloads. Below: API cheaper. Above: self-host worth considering.
Monitoring
OTel emits gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, model. Build dashboards.
// KQL on App Insights
customMetrics
| where name startswith "gen_ai"
| summarize total = sum(value) by model = tostring(customDimensions.model), tokens = tostring(customDimensions.token_type)
Cost alerts
- Daily budget per tenant.
- Alert at 80% of budget.
- Hard cap at 100% — return graceful error.
- Investigate spikes (abuse? bug? new feature?).
Examples
RAG chatbot, 1M queries/month
Avg query: 3K input + 500 output
With gpt-4o: 1M * (3K * 2.50/1M + 500 * 10/1M) = ~$12,500/mo
With gpt-4o-mini: 1M * (3K * 0.15/1M + 500 * 0.60/1M) = ~$750/mo
Savings: 94%
For most queries, mini is fine; use gpt-4o for hard ones.
Agent with tool use
Reasoning models multiply this 2-5x.
Embedding 1M docs
1M * 1K tokens * $0.02/M = $20 (text-embedding-3-small)
1M * 1K tokens * $0.13/M = $130 (text-embedding-3-large)
Senior considerations
- Daily/per-tenant budget: hard caps prevent runaway.
- Model routing as policy: documented; reviewed.
- Cost per-feature: tag spans with
feature_id. - Per-customer cost: for B2B, attribute and bill.
- Staging cost can dominate small teams — use cheaper models there.