Cost Tracking & Token Attribution
Key Points
- The senior's job: a per-feature, per-user, per-tenant cost dashboard for AI usage that you can show finance with no apologies.
- Unit economics is the question — "how much does feature X cost per call, per user, per month?" If you can't answer that, you can't price the feature, can't quota it, can't budget it.
- The raw signal: OpenTelemetry GenAI semantic conventions emit
gen_ai.usage.input_tokensandgen_ai.usage.output_tokensper request. Multiply by per-model rates → $ per call. - Attribution dimensions: model, feature/route/agent, user, tenant, environment. Tag every metric & log with these. Without them, the dashboard is a flat aggregate that tells you nothing.
- Patterns: (1) tag the metric stream in OTel + aggregate in App Insights/Datadog; (2) local cost ledger table (one row per call) for drill-in; (3) per-tenant soft-quota with alerts.
- Custom metric
gen_ai.cost.usd— there's no standard $ metric yet, define your own. - Implement as Microsoft.Extensions.AI middleware in the IChatClient pipeline — captures every call uniformly.
- Cost surprises: streaming bills output tokens even on disconnect; embeddings are cheap per token but high-volume; image gen is per-image flat (not per token).
- Forecasting: never linear-extrapolate — usage spikes. Use 95th percentile of recent week.
- Showback (visibility) vs chargeback (actual billing) — different conversations with finance.
- Anti-patterns: averaging cost across requests (P99 dominates revenue exposure), hiding spend in opaque cloud billing, not separating dev/staging/prod.
Concepts (deep dive)
The unit economics question
Your CFO walks up: "Our OpenAI bill jumped 40% this month. Why?" If your only answer is "AI got more popular" you've failed. The senior's answer is a dashboard with these slices:
This month: $48,200
by feature:
chat-support: $22,400 (+12%)
summarizer: $14,800 (+95%) ← the spike
embeddings: $ 6,100 (+4%)
image-gen: $ 4,900 (+2%)
by tenant (top 5):
tenant-acme: $ 8,200 ← largest
tenant-bigco: $ 6,400
...
by model:
gpt-4o: $32,100
text-embedding: $ 6,100
...
That requires every AI call to be tagged at emit time with feature, tenant, user, model, env. Tag-on-emit; aggregate later.
The OTel GenAI signal
OpenTelemetry GenAI semantic conventions (stable as of 2025) define:
gen_ai.usage.input_tokens (counter) ← prompt tokens
gen_ai.usage.output_tokens (counter) ← completion tokens
gen_ai.client.operation.duration (histogram)
gen_ai.request.model
gen_ai.response.model
gen_ai.system ← "openai", "anthropic", ...
Microsoft.Extensions.AI's UseOpenTelemetry() middleware emits these automatically. With strict (.UseOpenTelemetry(o => o.EnableSensitiveData = false)) you get the metrics without prompt/response bodies — safe for prod.
From tokens to dollars
There's no OTel-standard cost.usd metric (token prices are vendor-specific and change). You define a custom metric:
private static readonly Meter Meter = new("MyApp.AI", "1.0");
private static readonly Counter<double> CostUsd = Meter.CreateCounter<double>(
"gen_ai.cost.usd", unit: "USD", description: "AI usage cost in USD");
// inside middleware
var cost = (inputTokens * rates.InputPerToken) + (outputTokens * rates.OutputPerToken);
CostUsd.Add(cost,
new("gen_ai.system", system),
new("gen_ai.request.model", model),
new("feature", feature),
new("tenant", tenant),
new("env", env));
A small lookup table holds rates per model:
public record ModelRates(decimal InputPerToken, decimal OutputPerToken);
public static class Rates
{
private static readonly Dictionary<string, ModelRates> _table = new()
{
["gpt-4o"] = new(0.0000025m, 0.000010m),
["gpt-4o-mini"] = new(0.000000150m, 0.000000600m),
["claude-sonnet-4"] = new(0.000003m, 0.000015m),
["text-embedding-3-small"] = new(0.00000002m, 0m),
};
public static ModelRates Of(string model) => _table.GetValueOrDefault(model)
?? new(0m, 0m); // unknown model → log + zero
}
Keep this table in config, not source — rates change. Pull from App Configuration on a refresh.
Pattern 1 — Tag the metric stream
Every OTel metric carries dimensions; aggregate in your APM:
chat = chat.AsBuilder()
.Use(inner => new CostTrackingClient(inner, meter, ratesProvider, contextAccessor))
.UseOpenTelemetry()
.Build();
Then in App Insights / Datadog: - Group by feature → top expensive features. - Group by tenant → biggest spenders. - Group by gen_ai.request.model → model mix.
Pattern 2 — Local cost ledger
Metrics aggregate. Sometimes you need to drill into a specific call: "tenant ACME got billed $0.42 for one summary, why?" That's a row, not a metric.
CREATE TABLE AiCostLedger (
Id UNIQUEIDENTIFIER NOT NULL PRIMARY KEY,
TimestampUtc DATETIME2 NOT NULL,
TenantId NVARCHAR(64) NOT NULL,
UserId NVARCHAR(64) NULL,
Feature NVARCHAR(128) NOT NULL,
Model NVARCHAR(64) NOT NULL,
InputTokens INT NOT NULL,
OutputTokens INT NOT NULL,
CostUsd DECIMAL(18,6) NOT NULL,
DurationMs INT NOT NULL,
RequestId NVARCHAR(64) NOT NULL,
CacheHit BIT NOT NULL DEFAULT 0,
INDEX IX_TenantTime (TenantId, TimestampUtc),
INDEX IX_FeatureTime (Feature, TimestampUtc)
);
Insert one row per call. For high-volume systems write async via a buffered channel, or send to Application Insights custom events table and skip SQL entirely.
Pattern 3 — Per-tenant soft-quota
Once you have ledger rows, plan-aware quotas:
public async Task<bool> CheckQuotaAsync(string tenant, CancellationToken ct)
{
var monthlySpend = await _ledger.SumAsync(t => t.CostUsd,
where: t => t.TenantId == tenant && t.TimestampUtc >= MonthStart, ct);
var planLimit = await _plans.GetMonthlyLimitAsync(tenant, ct);
var ratio = monthlySpend / planLimit;
if (ratio > 1.0m) return false; // hard block
if (ratio > 0.8m) _alerts.Warn(tenant, $"{ratio:P0} of monthly limit"); // soft warn
return true;
}
Cache attribution
Caching saves money — measure that explicitly. When a UseDistributedCache() middleware returns a cached response, the cost middleware should record:
The estimate is the cost of the call had it not hit the cache. Useful KPI: "caching saved $4,800 last month."
KQL queries (App Insights)
// Top 10 most expensive features last 24h
customMetrics
| where name == "gen_ai.cost.usd"
| where timestamp > ago(24h)
| extend feature = tostring(customDimensions.feature)
| summarize TotalUsd = sum(value) by feature
| top 10 by TotalUsd desc
// Tenant cost trend last 30d, daily
customMetrics
| where name == "gen_ai.cost.usd" and customDimensions.tenant == "acme"
| where timestamp > ago(30d)
| summarize Usd = sum(value) by bin(timestamp, 1d)
| render timechart
// Cost per call distribution by model (find P99)
customMetrics
| where name == "gen_ai.cost.usd"
| where timestamp > ago(7d)
| extend model = tostring(customDimensions["gen_ai.request.model"])
| summarize p50 = percentile(value, 50), p95 = percentile(value, 95), p99 = percentile(value, 99) by model
Cost surprises a senior must know
- Streaming bills output tokens on disconnect. The model generates server-side regardless. Cancellation must abort the upstream HTTP call, not just iteration.
- Embeddings are cheap per token, huge in bulk. Re-indexing 100M tokens at $0.02/1M = $2,000. Track totals.
- Image generation is per-image flat (DALL-E 3: $0.04–$0.12). Different rate code path.
- Function calls cost both ways — model emits call (output), you append result, model continues (input). Multi-tool agents 2–5× single-call cost.
- Multimodal input has model-specific token math (image tile sizes; audio sampling).
- Reasoning models (o1, o3) bill invisible reasoning tokens as output. A 200-token answer may bill 4,000.
- Cached prompts bill at 10–50% of full rate for the cached portion. Track cache hit rate and discounted spend separately.
Forecasting
Linear extrapolation overshoots on dips and undershoots on spikes. Better: 95th percentile of recent week × days remaining, plus alert if last 24h > 1.5× trailing 7d average. For seasonal apps add weekday/weekend factors.
Showback vs chargeback
| Mode | Definition | When |
|---|---|---|
| Showback | Show tenants their usage; no money changes hands | Internal cost visibility, FinOps culture |
| Chargeback | Bill tenants for their usage | Multi-tenant SaaS with metered AI |
| Allocation | Internal: assign to cost centers / teams | Cross-team platforms |
Chargeback raises the bar on accuracy — disputes will happen. Lock the ledger rows; expose an audit endpoint per tenant.
How it works under the hood
Middleware pipeline
[Caller]
│
▼
ChatClient.GetResponseAsync(messages, options, ct)
│
▼
[UseFunctionInvocation]
│
▼
[CostTrackingClient] ← (a) capture context (tenant/feature/user)
│ (b) start stopwatch
▼
[UseDistributedCache] ← if hit, return without invoking provider
│
▼
[OpenTelemetry] ← emit gen_ai.usage.* metrics
│
▼
[OpenAI / Anthropic provider]
│
▲
│ response
▼
[CostTrackingClient] ← (c) compute cost from usage
(d) emit gen_ai.cost.usd metric
(e) write ledger row
(f) check quota
CostTrackingClient skeleton
public sealed class CostTrackingClient(
IChatClient inner, IRateProvider rates,
ICostLedger ledger, IAiContextAccessor ctx) : DelegatingChatClient(inner)
{
private static readonly Meter _meter = new("MyApp.AI.Cost");
private static readonly Counter<double> _costUsd =
_meter.CreateCounter<double>("gen_ai.cost.usd", "USD");
public override async Task<ChatResponse> GetResponseAsync(
IEnumerable<ChatMessage> messages, ChatOptions? options = null, CancellationToken ct = default)
{
var sw = Stopwatch.StartNew();
var resp = await base.GetResponseAsync(messages, options, ct);
sw.Stop();
if (resp.Usage is not { } u) return resp;
var model = options?.ModelId ?? "unknown";
var rate = rates.Get(model);
var cost = (decimal)u.InputTokenCount * rate.InputPerToken
+ (decimal)u.OutputTokenCount * rate.OutputPerToken;
var tags = new TagList
{
{ "gen_ai.request.model", model },
{ "feature", ctx.Feature }, { "tenant", ctx.TenantId }, { "env", ctx.Environment }
};
_costUsd.Add((double)cost, tags);
await ledger.WriteAsync(new CostLedgerRow(
Guid.NewGuid(), DateTime.UtcNow, ctx.TenantId, ctx.UserId, ctx.Feature, model,
u.InputTokenCount, u.OutputTokenCount, cost,
(int)sw.ElapsedMilliseconds, resp.ResponseId ?? "", CacheHit: false), ct);
return resp;
}
}
IAiContextAccessor is your own — pull tenant/feature/user from HttpContext, agent name, or activity baggage. Without context, attribution is impossible.
Code: correct vs wrong
✅ Correct — middleware-based, per-call attribution
(see CostTrackingClient above)
❌ Wrong — averaging cost in dashboards
Average is meaningless. P99 dominates revenue exposure. Use sum + percentile distribution.
❌ Wrong — trusting cloud bill alone
Azure OpenAI bill arrives 1–3 days late, aggregated, with no per-feature tagging. By the time you see a $5k anomaly, it's been running 4 days. You need real-time, attributed metrics.
❌ Wrong — global rates constant in code
Pull rates from config; refresh on schedule.
❌ Wrong — same telemetry for dev and prod
Dev/staging eat into your "AI bill" view; finance asks why prod went up when it didn't. Always tag env.
✅ Correct — soft quota with alert
if (await quota.CheckAsync(tenant) is QuotaState.Warning state)
_alerts.Notify(tenant, $"AI usage at {state.Ratio:P0} of plan limit");
else if (state is QuotaState.Exceeded)
return Results.StatusCode(429);
✅ Correct — cache savings tracked
Design patterns for this topic
Pattern 1 — "Tag-on-emit, aggregate later"
- Intent: every AI call emits with feature/tenant/user/model dimensions; downstream tooling slices freely.
Pattern 2 — "Cost ledger table"
- Intent: drill-in capability one row per call.
- Storage: SQL or App Insights customEvents. Async-buffered write.
Pattern 3 — "Soft quota → hard quota"
- Intent: warn at 80% of plan; block at 100%.
- Mechanism: middleware checks ledger sum vs plan limit per tenant.
Pattern 4 — "Cache savings KPI"
- Intent: show finance the value of cache investment.
- Mechanism: record what the call would have cost on each hit.
Pattern 5 — "Rates from config"
- Intent: rate changes don't require deploy.
- Mechanism: App Configuration; refresh on interval; alert on unknown model.
Pattern 6 — "Forecast with P95"
- Intent: end-of-month projection that doesn't get blown up by spikes.
- Mechanism: P95 of last 7 days × days remaining; alert if last 24h > 1.5× P95.
Pros & cons / trade-offs
| Approach | Pros | Cons |
|---|---|---|
| Metrics only | Cheap; aggregate-friendly | No per-call drill-in |
| Ledger table | Audit + drill-in | Storage cost; write throughput |
| Cloud bill only | Zero work | Lagged; un-attributed; finds anomalies too late |
| Per-tenant chargeback | Aligns incentives | Disputes; accuracy bar |
| Showback only | Visibility without billing pain | No spending pressure |
Custom cost.usd metric | $ visible in dashboards | You own rate table maintenance |
When to use / when to avoid
- Use middleware-based tracking from day 1 of every AI feature. Cheaper to add now than retrofit.
- Use ledger when finance needs audit, when chargeback applies, or when bills are large enough to merit drill-in (>$1k/mo).
- Use quotas in multi-tenant SaaS where one tenant can DoS your AI budget.
- Avoid building dashboards before you have tagged data — you'll just see noise.
- Avoid averaging — sum + P99.
- Avoid linear forecasting — usage is bursty.
- Avoid in-memory rate tables — externalize to config.
Interview Q&A
Q1. What OTel signals does GenAI define for cost? gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (counters), plus gen_ai.client.operation.duration (histogram), with attributes gen_ai.system, gen_ai.request.model, gen_ai.response.model.
Q2. Is there a standard cost.usd metric? No. Define your own custom metric (gen_ai.cost.usd); compute as tokens × rate.
Q3. How do you attribute cost to a feature/tenant/user? Tag every emitted metric and ledger row with feature/tenant/user dimensions. Without tags, attribution is impossible.
Q4. Where does cost tracking live in the IChatClient pipeline? A DelegatingChatClient middleware — captures usage from response, computes cost, emits metric, writes ledger row.
Q5. Streaming + cancellation cost gotcha? Server still generates tokens after the client disconnects unless the upstream HTTP call is actually aborted. Set timeouts; pass cancellation tokens through.
Q6. Anti-pattern: averaging cost? Yes — averages hide P99. One $50 query lost in 10,000 cheap ones disappears. Always use sum + percentile.
Q7. Showback vs chargeback? Showback = visibility, no billing. Chargeback = actually bill the tenant. Chargeback raises accuracy + dispute bar.
Q8. How do you handle rate changes? Externalize to config (App Configuration); refresh interval; alert on unknown model.
Q9. Embeddings cost surprise? Cheap per token but enormous in bulk — re-indexing huge corpora can cost thousands. Track totals.
Q10. Reasoning model billing? o1/o3-style models bill reasoning tokens as output even when invisible — answer of 200 tokens may bill 4,000.
Q11. Forecasting end of month? P95 of last 7 days × days remaining; alert if last 24h > 1.5× P95.
Q12. KQL for top expensive features?
customMetrics | where name == "gen_ai.cost.usd"
| summarize sum(value) by tostring(customDimensions.feature)
| top 10 by sum_value desc
Q13. Cache savings — how to surface? On cache hit, emit gen_ai.cost_saved.usd = what call would have cost. Sum across period = ROI of cache.
Q14. Per-tenant quota approach? Sum tenant's monthly cost from ledger; compare to plan limit; warn at 80%, block at 100%.
Gotchas / common mistakes
- ⚠️ Forgetting to tag env → dev usage skews "prod cost" dashboards.
- ⚠️ Averaging instead of summing + P99.
- ⚠️ Hardcoded rate constants — out of date by next quarter.
- ⚠️ Streaming disconnects don't save money unless you abort the upstream call.
- ⚠️ Reasoning tokens billed but invisible —
o1looks expensive without explanation. - ⚠️ Multi-tool agents multiply token cost; tagging only the user request hides amplification.
- ⚠️ Ledger writes blocking the request path → use async buffer.
- ⚠️ Cloud bill anomaly detection is days late; build real-time alerts.
- ⚠️ Cache hit savings not measured → can't justify cache investment.
- ⚠️ Image gen flat per image, not per token — different rate code path.
- ⚠️ No alert when tenant exceeds plan → silent overspend until month-end shock.
- ⚠️ Treating prompt-cached tokens at full rate when provider charges 10–50%.