Skip to content

Pricing & Cost Engineering

Key Points

  • LLM pricing = input tokens × $/M + output tokens × $/M (output 3-5x input).
  • Prompt caching (Anthropic, OpenAI) → ~90% discount on cached prefix; HUGE for chatbots / RAG with stable system prompt.
  • Batch API (OpenAI, Anthropic): 50% discount for non-real-time.
  • Levers: cheap default model, prompt cache, response cache, max_tokens cap, RAG (vs long context), embedding dim reduction.
  • At scale: predictable cost requires routing, caching, monitoring, alerts.

Pricing landscape (mid-2026)

Per 1M tokens, INPUT/OUTPUT:

GPT-5:                $15 / $60
GPT-4o:               $2.50 / $10
GPT-4o-mini:          $0.15 / $0.60
o3:                   $15 / $60+
o3-mini:              $1.10 / $4.40

Claude 4 Opus:        $15 / $75
Claude 4 Sonnet:      $3 / $15
Claude 4 Haiku:       $0.25 / $1.25

Gemini 2.5 Pro:       $3.50 / $10.50
Gemini 2.5 Flash:     $0.30 / $2.50

Llama 3.3 70B (Together): $0.20 / $0.60
Llama 3.3 70B (Groq):    $0.59 / $0.79
DeepSeek R1 (DeepSeek):   ~$0.55 / $2.19

text-embedding-3-small: $0.02
text-embedding-3-large: $0.13
voyage-3: $0.06
Cohere Rerank: $0.20-0.30 / 1K queries

Prices change. Always verify with vendor's pricing page.

Token math

"Hello, world!" ≈ 4 tokens
1 KB English text ≈ 250-300 tokens
1 page (500 words) ≈ 700 tokens
1M tokens ≈ ~750K English words ≈ ~1500 pages

Use vendor tokenizer libraries (tiktoken for OpenAI, etc.) to estimate accurately.

Prompt caching

Anthropic & OpenAI automatically cache prefixes that repeat across calls.

Anthropic:

// Mark stable content as cacheable
new MessageParameters
{
    System = [new TextContent("Long system prompt...") { CacheControl = new() { Type = "ephemeral" } }],
    /* ... */
};

90% discount on cache hit. 5-min TTL.

OpenAI: - Automatic for prompts >1024 tokens. - Same prefix → cached.

Optimization: put stable content (system prompt, RAG context) at the start. Append user query at the end.

Batch API

// Submit batch; results in 24h
var batch = await client.Batches.CreateBatchAsync(...);

50% discount. For non-real-time workloads (eval, dataset generation, summarization runs).

Cost levers

1. Cheap default model

public IChatClient Pick(Request r) =>
    r.RequiresFrontier ? _gpt5 : _gpt4oMini;

gpt-4o-mini handles 80% of requests; promote only when needed.

2. Response caching

For identical or semantically similar queries:

chat = chat.AsBuilder().UseDistributedCache(redisCache).Build();

Microsoft.Extensions.AI has UseDistributedCache middleware. Hash prompt → cache response.

For semantic caching: hash embedding bucket → cached responses for similar.

3. RAG over long context

Don't dump entire KB into prompt. Retrieve top-K. Smaller context = cheaper + better.

4. max_tokens cap

new ChatOptions { MaxOutputTokens = 500 }

Prevents runaway responses (especially with reasoning models).

5. Streaming + early termination

Stream output; user can stop if going wrong → save tokens.

6. Embedding optimization

  • text-embedding-3-small over -large (saves 6x cost).
  • Dimension reduction (1536 → 256): saves storage + search cost; minor quality loss.
  • Cache embeddings for unchanged docs.

7. Quantize / cheaper rerankers

Use small open rerankers (BGE) over Cohere if quality OK.

8. Self-host break-even

Per-token cost (API) × volume vs (GPU $/hr × hours)

Break-even typically >10M tokens/day on dense workloads. Below: API cheaper. Above: self-host worth considering.

Monitoring

chat = chat.AsBuilder().UseOpenTelemetry(o => o.EnableSensitiveData = false).Build();

OTel emits gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, model. Build dashboards.

// KQL on App Insights
customMetrics
| where name startswith "gen_ai"
| summarize total = sum(value) by model = tostring(customDimensions.model), tokens = tostring(customDimensions.token_type)

Cost alerts

  • Daily budget per tenant.
  • Alert at 80% of budget.
  • Hard cap at 100% — return graceful error.
  • Investigate spikes (abuse? bug? new feature?).

Examples

RAG chatbot, 1M queries/month

Avg query: 3K input + 500 output
With gpt-4o:    1M * (3K * 2.50/1M + 500 * 10/1M) = ~$12,500/mo
With gpt-4o-mini: 1M * (3K * 0.15/1M + 500 * 0.60/1M) = ~$750/mo
Savings: 94%

For most queries, mini is fine; use gpt-4o for hard ones.

Agent with tool use

3-5 tool roundtrips × 2K input × 200 output = ~12K input + 1K output per agent run

Reasoning models multiply this 2-5x.

Embedding 1M docs

1M * 1K tokens * $0.02/M = $20 (text-embedding-3-small)
1M * 1K tokens * $0.13/M = $130 (text-embedding-3-large)

Senior considerations

  • Daily/per-tenant budget: hard caps prevent runaway.
  • Model routing as policy: documented; reviewed.
  • Cost per-feature: tag spans with feature_id.
  • Per-customer cost: for B2B, attribute and bill.
  • Staging cost can dominate small teams — use cheaper models there.

Cross-references