Cost Tracking & Token Attribution

Key Points

The senior's job: a per-feature, per-user, per-tenant cost dashboard for AI usage that you can show finance with no apologies.
Unit economics is the question — "how much does feature X cost per call, per user, per month?" If you can't answer that, you can't price the feature, can't quota it, can't budget it.
The raw signal: OpenTelemetry GenAI semantic conventions emit gen_ai.usage.input_tokens and gen_ai.usage.output_tokens per request. Multiply by per-model rates → $ per call.
Attribution dimensions: model, feature/route/agent, user, tenant, environment. Tag every metric & log with these. Without them, the dashboard is a flat aggregate that tells you nothing.
Patterns: (1) tag the metric stream in OTel + aggregate in App Insights/Datadog; (2) local cost ledger table (one row per call) for drill-in; (3) per-tenant soft-quota with alerts.
Custom metric gen_ai.cost.usd — there's no standard $ metric yet, define your own.
Implement as Microsoft.Extensions.AI middleware in the IChatClient pipeline — captures every call uniformly.
Cost surprises: streaming bills output tokens even on disconnect; embeddings are cheap per token but high-volume; image gen is per-image flat (not per token).
Forecasting: never linear-extrapolate — usage spikes. Use 95^th percentile of recent week.
Showback (visibility) vs chargeback (actual billing) — different conversations with finance.
Anti-patterns: averaging cost across requests (P99 dominates revenue exposure), hiding spend in opaque cloud billing, not separating dev/staging/prod.

Concepts (deep dive)

The unit economics question

Your CFO walks up: "Our OpenAI bill jumped 40% this month. Why?" If your only answer is "AI got more popular" you've failed. The senior's answer is a dashboard with these slices:

This month: $48,200
  by feature:
    chat-support:   $22,400  (+12%)
    summarizer:     $14,800  (+95%)   ← the spike
    embeddings:     $ 6,100  (+4%)
    image-gen:      $ 4,900  (+2%)
  by tenant (top 5):
    tenant-acme:    $ 8,200  ← largest
    tenant-bigco:   $ 6,400
    ...
  by model:
    gpt-4o:         $32,100
    text-embedding: $ 6,100
    ...

That requires every AI call to be tagged at emit time with feature, tenant, user, model, env. Tag-on-emit; aggregate later.

The OTel GenAI signal

OpenTelemetry GenAI semantic conventions (stable as of 2025) define:

gen_ai.usage.input_tokens     (counter)   ← prompt tokens
gen_ai.usage.output_tokens    (counter)   ← completion tokens
gen_ai.client.operation.duration (histogram)
gen_ai.request.model
gen_ai.response.model
gen_ai.system                 ← "openai", "anthropic", ...

Microsoft.Extensions.AI's UseOpenTelemetry() middleware emits these automatically. With strict (.UseOpenTelemetry(o => o.EnableSensitiveData = false)) you get the metrics without prompt/response bodies — safe for prod.

From tokens to dollars

There's no OTel-standard cost.usd metric (token prices are vendor-specific and change). You define a custom metric:

private static readonly Meter Meter = new("MyApp.AI", "1.0");
private static readonly Counter<double> CostUsd = Meter.CreateCounter<double>(
    "gen_ai.cost.usd", unit: "USD", description: "AI usage cost in USD");

// inside middleware
var cost = (inputTokens * rates.InputPerToken) + (outputTokens * rates.OutputPerToken);
CostUsd.Add(cost,
    new("gen_ai.system", system),
    new("gen_ai.request.model", model),
    new("feature", feature),
    new("tenant", tenant),
    new("env", env));

A small lookup table holds rates per model:

public record ModelRates(decimal InputPerToken, decimal OutputPerToken);

public static class Rates
{
    private static readonly Dictionary<string, ModelRates> _table = new()
    {
        ["gpt-4o"]              = new(0.0000025m, 0.000010m),
        ["gpt-4o-mini"]         = new(0.000000150m, 0.000000600m),
        ["claude-sonnet-4"]     = new(0.000003m,  0.000015m),
        ["text-embedding-3-small"] = new(0.00000002m, 0m),
    };
    public static ModelRates Of(string model) => _table.GetValueOrDefault(model)
        ?? new(0m, 0m); // unknown model → log + zero
}

Keep this table in config, not source — rates change. Pull from App Configuration on a refresh.

Pattern 1 — Tag the metric stream

Every OTel metric carries dimensions; aggregate in your APM:

chat = chat.AsBuilder()
    .Use(inner => new CostTrackingClient(inner, meter, ratesProvider, contextAccessor))
    .UseOpenTelemetry()
    .Build();

Then in App Insights / Datadog: - Group by feature → top expensive features. - Group by tenant → biggest spenders. - Group by gen_ai.request.model → model mix.

Pattern 2 — Local cost ledger

Metrics aggregate. Sometimes you need to drill into a specific call: "tenant ACME got billed $0.42 for one summary, why?" That's a row, not a metric.

CREATE TABLE AiCostLedger (
    Id              UNIQUEIDENTIFIER NOT NULL PRIMARY KEY,
    TimestampUtc    DATETIME2 NOT NULL,
    TenantId        NVARCHAR(64) NOT NULL,
    UserId          NVARCHAR(64) NULL,
    Feature         NVARCHAR(128) NOT NULL,
    Model           NVARCHAR(64) NOT NULL,
    InputTokens     INT NOT NULL,
    OutputTokens    INT NOT NULL,
    CostUsd         DECIMAL(18,6) NOT NULL,
    DurationMs      INT NOT NULL,
    RequestId       NVARCHAR(64) NOT NULL,
    CacheHit        BIT NOT NULL DEFAULT 0,
    INDEX IX_TenantTime (TenantId, TimestampUtc),
    INDEX IX_FeatureTime (Feature, TimestampUtc)
);

Insert one row per call. For high-volume systems write async via a buffered channel, or send to Application Insights custom events table and skip SQL entirely.

Pattern 3 — Per-tenant soft-quota

Once you have ledger rows, plan-aware quotas:

public async Task<bool> CheckQuotaAsync(string tenant, CancellationToken ct)
{
    var monthlySpend = await _ledger.SumAsync(t => t.CostUsd,
        where: t => t.TenantId == tenant && t.TimestampUtc >= MonthStart, ct);
    var planLimit = await _plans.GetMonthlyLimitAsync(tenant, ct);

    var ratio = monthlySpend / planLimit;
    if (ratio > 1.0m) return false;                                       // hard block
    if (ratio > 0.8m) _alerts.Warn(tenant, $"{ratio:P0} of monthly limit"); // soft warn
    return true;
}

Cache attribution

Caching saves money — measure that explicitly. When a UseDistributedCache() middleware returns a cached response, the cost middleware should record:

CostSavedUsd.Add(estimatedCost, new("feature", feature), new("tenant", tenant));

The estimate is the cost of the call had it not hit the cache. Useful KPI: "caching saved $4,800 last month."

KQL queries (App Insights)

// Top 10 most expensive features last 24h
customMetrics
| where name == "gen_ai.cost.usd"
| where timestamp > ago(24h)
| extend feature = tostring(customDimensions.feature)
| summarize TotalUsd = sum(value) by feature
| top 10 by TotalUsd desc

// Tenant cost trend last 30d, daily
customMetrics
| where name == "gen_ai.cost.usd" and customDimensions.tenant == "acme"
| where timestamp > ago(30d)
| summarize Usd = sum(value) by bin(timestamp, 1d)
| render timechart

// Cost per call distribution by model (find P99)
customMetrics
| where name == "gen_ai.cost.usd"
| where timestamp > ago(7d)
| extend model = tostring(customDimensions["gen_ai.request.model"])
| summarize p50 = percentile(value, 50), p95 = percentile(value, 95), p99 = percentile(value, 99) by model

Cost surprises a senior must know

Streaming bills output tokens on disconnect. The model generates server-side regardless. Cancellation must abort the upstream HTTP call, not just iteration.
Embeddings are cheap per token, huge in bulk. Re-indexing 100M tokens at $0.02/1M = $2,000. Track totals.
Image generation is per-image flat (DALL-E 3: $0.04–$0.12). Different rate code path.
Function calls cost both ways — model emits call (output), you append result, model continues (input). Multi-tool agents 2–5× single-call cost.
Multimodal input has model-specific token math (image tile sizes; audio sampling).
Reasoning models (o1, o3) bill invisible reasoning tokens as output. A 200-token answer may bill 4,000.
Cached prompts bill at 10–50% of full rate for the cached portion. Track cache hit rate and discounted spend separately.

Forecasting

Linear extrapolation overshoots on dips and undershoots on spikes. Better: 95^th percentile of recent week × days remaining, plus alert if last 24h > 1.5× trailing 7d average. For seasonal apps add weekday/weekend factors.

forecast_remaining_month = p95_daily(last_7d) × days_left_in_month

Showback vs chargeback

Mode	Definition	When
Showback	Show tenants their usage; no money changes hands	Internal cost visibility, FinOps culture
Chargeback	Bill tenants for their usage	Multi-tenant SaaS with metered AI
Allocation	Internal: assign to cost centers / teams	Cross-team platforms

Chargeback raises the bar on accuracy — disputes will happen. Lock the ledger rows; expose an audit endpoint per tenant.

How it works under the hood

Middleware pipeline

[Caller]
   │
   ▼
ChatClient.GetResponseAsync(messages, options, ct)
   │
   ▼
[UseFunctionInvocation]
   │
   ▼
[CostTrackingClient]   ← (a) capture context (tenant/feature/user)
   │                     (b) start stopwatch
   ▼
[UseDistributedCache]  ← if hit, return without invoking provider
   │
   ▼
[OpenTelemetry]        ← emit gen_ai.usage.* metrics
   │
   ▼
[OpenAI / Anthropic provider]
   │
   ▲
   │ response
   ▼
[CostTrackingClient]   ← (c) compute cost from usage
                         (d) emit gen_ai.cost.usd metric
                         (e) write ledger row
                         (f) check quota

CostTrackingClient skeleton

public sealed class CostTrackingClient(
    IChatClient inner, IRateProvider rates,
    ICostLedger ledger, IAiContextAccessor ctx) : DelegatingChatClient(inner)
{
    private static readonly Meter _meter = new("MyApp.AI.Cost");
    private static readonly Counter<double> _costUsd =
        _meter.CreateCounter<double>("gen_ai.cost.usd", "USD");

    public override async Task<ChatResponse> GetResponseAsync(
        IEnumerable<ChatMessage> messages, ChatOptions? options = null, CancellationToken ct = default)
    {
        var sw = Stopwatch.StartNew();
        var resp = await base.GetResponseAsync(messages, options, ct);
        sw.Stop();

        if (resp.Usage is not { } u) return resp;
        var model = options?.ModelId ?? "unknown";
        var rate  = rates.Get(model);
        var cost  = (decimal)u.InputTokenCount * rate.InputPerToken
                  + (decimal)u.OutputTokenCount * rate.OutputPerToken;

        var tags = new TagList
        {
            { "gen_ai.request.model", model },
            { "feature", ctx.Feature }, { "tenant", ctx.TenantId }, { "env", ctx.Environment }
        };
        _costUsd.Add((double)cost, tags);

        await ledger.WriteAsync(new CostLedgerRow(
            Guid.NewGuid(), DateTime.UtcNow, ctx.TenantId, ctx.UserId, ctx.Feature, model,
            u.InputTokenCount, u.OutputTokenCount, cost,
            (int)sw.ElapsedMilliseconds, resp.ResponseId ?? "", CacheHit: false), ct);

        return resp;
    }
}

IAiContextAccessor is your own — pull tenant/feature/user from HttpContext, agent name, or activity baggage. Without context, attribution is impossible.

Code: correct vs wrong

✅ Correct — middleware-based, per-call attribution

(see CostTrackingClient above)

❌ Wrong — averaging cost in dashboards

customMetrics | where name == "gen_ai.cost.usd" | summarize avg(value)

Average is meaningless. P99 dominates revenue exposure. Use sum + percentile distribution.

❌ Wrong — trusting cloud bill alone

Azure OpenAI bill arrives 1–3 days late, aggregated, with no per-feature tagging. By the time you see a $5k anomaly, it's been running 4 days. You need real-time, attributed metrics.

❌ Wrong — global rates constant in code

const decimal GPT4_RATE = 0.00003m;   // out of date in 3 months

Pull rates from config; refresh on schedule.

❌ Wrong — same telemetry for dev and prod

CostUsd.Add(cost, new("model", model)); // no env tag

Dev/staging eat into your "AI bill" view; finance asks why prod went up when it didn't. Always tag env.

✅ Correct — soft quota with alert

if (await quota.CheckAsync(tenant) is QuotaState.Warning state)
    _alerts.Notify(tenant, $"AI usage at {state.Ratio:P0} of plan limit");
else if (state is QuotaState.Exceeded)
    return Results.StatusCode(429);

✅ Correct — cache savings tracked

if (cacheHit)
    _costSavedUsd.Add((double)estimatedCost, tags);

Design patterns for this topic

Pattern 1 — "Tag-on-emit, aggregate later"

Intent: every AI call emits with feature/tenant/user/model dimensions; downstream tooling slices freely.

Pattern 2 — "Cost ledger table"

Intent: drill-in capability one row per call.
Storage: SQL or App Insights customEvents. Async-buffered write.

Pattern 3 — "Soft quota → hard quota"

Intent: warn at 80% of plan; block at 100%.
Mechanism: middleware checks ledger sum vs plan limit per tenant.

Pattern 4 — "Cache savings KPI"

Intent: show finance the value of cache investment.
Mechanism: record what the call would have cost on each hit.

Pattern 5 — "Rates from config"

Intent: rate changes don't require deploy.
Mechanism: App Configuration; refresh on interval; alert on unknown model.

Pattern 6 — "Forecast with P95"

Intent: end-of-month projection that doesn't get blown up by spikes.
Mechanism: P95 of last 7 days × days remaining; alert if last 24h > 1.5× P95.

Pros & cons / trade-offs

Approach	Pros	Cons
Metrics only	Cheap; aggregate-friendly	No per-call drill-in
Ledger table	Audit + drill-in	Storage cost; write throughput
Cloud bill only	Zero work	Lagged; un-attributed; finds anomalies too late
Per-tenant chargeback	Aligns incentives	Disputes; accuracy bar
Showback only	Visibility without billing pain	No spending pressure
Custom `cost.usd` metric	$ visible in dashboards	You own rate table maintenance

When to use / when to avoid

Use middleware-based tracking from day 1 of every AI feature. Cheaper to add now than retrofit.
Use ledger when finance needs audit, when chargeback applies, or when bills are large enough to merit drill-in (>$1k/mo).
Use quotas in multi-tenant SaaS where one tenant can DoS your AI budget.
Avoid building dashboards before you have tagged data — you'll just see noise.
Avoid averaging — sum + P99.
Avoid linear forecasting — usage is bursty.
Avoid in-memory rate tables — externalize to config.

Interview Q&A

Q1. What OTel signals does GenAI define for cost? gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (counters), plus gen_ai.client.operation.duration (histogram), with attributes gen_ai.system, gen_ai.request.model, gen_ai.response.model.

Q2. Is there a standard cost.usd metric? No. Define your own custom metric (gen_ai.cost.usd); compute as tokens × rate.

Q3. How do you attribute cost to a feature/tenant/user? Tag every emitted metric and ledger row with feature/tenant/user dimensions. Without tags, attribution is impossible.

Q4. Where does cost tracking live in the IChatClient pipeline? A DelegatingChatClient middleware — captures usage from response, computes cost, emits metric, writes ledger row.

Q5. Streaming + cancellation cost gotcha? Server still generates tokens after the client disconnects unless the upstream HTTP call is actually aborted. Set timeouts; pass cancellation tokens through.

Q6. Anti-pattern: averaging cost? Yes — averages hide P99. One $50 query lost in 10,000 cheap ones disappears. Always use sum + percentile.

Q7. Showback vs chargeback? Showback = visibility, no billing. Chargeback = actually bill the tenant. Chargeback raises accuracy + dispute bar.

Q8. How do you handle rate changes? Externalize to config (App Configuration); refresh interval; alert on unknown model.

Q9. Embeddings cost surprise? Cheap per token but enormous in bulk — re-indexing huge corpora can cost thousands. Track totals.

Q10. Reasoning model billing? o1/o3-style models bill reasoning tokens as output even when invisible — answer of 200 tokens may bill 4,000.

Q11. Forecasting end of month? P95 of last 7 days × days remaining; alert if last 24h > 1.5× P95.

Q12. KQL for top expensive features?

customMetrics | where name == "gen_ai.cost.usd"
| summarize sum(value) by tostring(customDimensions.feature)
| top 10 by sum_value desc

Q13. Cache savings — how to surface? On cache hit, emit gen_ai.cost_saved.usd = what call would have cost. Sum across period = ROI of cache.

Q14. Per-tenant quota approach? Sum tenant's monthly cost from ledger; compare to plan limit; warn at 80%, block at 100%.

Gotchas / common mistakes

⚠️ Forgetting to tag env → dev usage skews "prod cost" dashboards.
⚠️ Averaging instead of summing + P99.
⚠️ Hardcoded rate constants — out of date by next quarter.
⚠️ Streaming disconnects don't save money unless you abort the upstream call.
⚠️ Reasoning tokens billed but invisible — o1 looks expensive without explanation.
⚠️ Multi-tool agents multiply token cost; tagging only the user request hides amplification.
⚠️ Ledger writes blocking the request path → use async buffer.
⚠️ Cloud bill anomaly detection is days late; build real-time alerts.
⚠️ Cache hit savings not measured → can't justify cache investment.
⚠️ Image gen flat per image, not per token — different rate code path.
⚠️ No alert when tenant exceeds plan → silent overspend until month-end shock.
⚠️ Treating prompt-cached tokens at full rate when provider charges 10–50%.