Telemetry & Caching

Key Points

UseOpenTelemetry() emits OTel GenAI semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens.
UseLogging() structured logs per request.
UseDistributedCache() hashes prompt → cached response. Massive cost saver.
Sensitive data: by default, prompts and responses are NOT logged. Opt-in with EnableSensitiveData = true.
For cost: combine cache + token counters + alerts.

OpenTelemetry middleware

chat = chat.AsBuilder()
    .UseOpenTelemetry(sourceName: "MyApp.AI", configure: o =>
    {
        o.EnableSensitiveData = false;  // default; don't log prompts/responses
    })
    .Build();

Emits per-call:

Activity span: "chat" (or model name)
  gen_ai.system: "openai"
  gen_ai.request.model: "gpt-4o-mini"
  gen_ai.usage.input_tokens: 1234
  gen_ai.usage.output_tokens: 567
  gen_ai.response.id: "chatcmpl-..."
  gen_ai.response.model: "gpt-4o-mini-2024-..."
  gen_ai.response.finish_reason: ["stop"]

When EnableSensitiveData=true:

Span event: "gen_ai.user.message"  { content: "..." }
Span event: "gen_ai.assistant.message" { content: "..." }

OTel pipeline

builder.Services.AddOpenTelemetry()
    .WithTracing(t => t.AddSource("MyApp.AI").AddOtlpExporter())
    .WithMetrics(m => m.AddOtlpExporter());

Connects spans to OTLP collector → Datadog / App Insights / Jaeger / etc.

Logging middleware

chat = chat.AsBuilder()
    .UseLogging(loggerFactory)
    .Build();

Logs: - Request start (model, message count). - Response (tokens, finish reason). - Errors.

Default logs metadata only. For full prompts:

.UseLogging(loggerFactory, configure: o => o.EnableSensitiveData = true)

Distributed cache middleware

builder.Services.AddStackExchangeRedisCache(o => o.Configuration = redisConn);

chat = chat.AsBuilder()
    .UseDistributedCache(sp.GetRequiredService<IDistributedCache>())
    .Build();

Hashes (model + messages + options) → cache key. Identical request → cached response. No API call.

new ChatOptions
{
    /* ... */
    AdditionalProperties = new() { ["cache_ttl"] = TimeSpan.FromHours(1) }
}

Cache hit rate

// App Insights
customMetrics | where name == "ai.cache.hit_rate"

Aim for 30%+ for production chatbots. Higher = greater savings.

Layered cache

chat = chat.AsBuilder()
    .UseDistributedCache(redisCache)   // L2
    .Build();
// Microsoft.Extensions.AI's HybridCache integration in-progress at time of writing

Semantic caching (advanced)

Identical prompts cache trivially. Semantically similar (paraphrased) doesn't, by default.

For semantic: hash embedding bucket → cached responses.

public class SemanticCache : IChatClient
{
    public async Task<ChatResponse> GetResponseAsync(...)
    {
        var queryEmb = await _embed.GenerateAsync(query);
        var nearest = await _vectorCache.SearchAsync(queryEmb, threshold: 0.95);
        if (nearest is { } cached) return cached.Response;

        var fresh = await _inner.GetResponseAsync(...);
        await _vectorCache.UpsertAsync(queryEmb, fresh);
        return fresh;
    }
}

Trade-off: false positives if threshold too lax.

Token counters

private static readonly Meter _m = new("MyApp.AI");
private static readonly Counter<long> _inTokens = _m.CreateCounter<long>("ai.tokens.input");
private static readonly Counter<long> _outTokens = _m.CreateCounter<long>("ai.tokens.output");

Track per tenant, per feature, per model:

chat = chat.AsBuilder().Use(c => new TokenCountingClient(c, /* tags */)).Build();

(Or wrap with custom DelegatingChatClient.)

Cost alerts

Per-tenant daily budget:

public class BudgetGuardClient(IChatClient inner, IBudgetService b) : DelegatingChatClient(inner)
{
    public override async Task<ChatResponse> GetResponseAsync(...)
    {
        if (await b.IsExceeded(tenantId)) throw new BudgetExceededException();
        var resp = await base.GetResponseAsync(...);
        await b.AddAsync(tenantId, resp.Usage?.InputTokenCount ?? 0, resp.Usage?.OutputTokenCount ?? 0);
        return resp;
    }
}

Per-request observability

[HttpPost("/chat")]
public async Task<IActionResult> Chat(string q)
{
    using var activity = _activitySource.StartActivity("chat-request");
    activity?.SetTag("user.id", User.FindFirstValue("sub"));
    activity?.SetTag("tenant.id", _tenant.Id);

    var resp = await _chat.GetResponseAsync(q);   // OTel middleware adds GenAI tags

    activity?.SetTag("output.length", resp.Text.Length);
    return Ok(resp.Text);
}

Sensitive data handling

Don't log prompts/responses by default — they may contain PII, customer data.

For audit / debugging: log to a secure store with retention; redact PII.

chat = chat.AsBuilder()
    .Use(c => new RedactingClient(c))   // strips emails, SSN, etc.
    .UseLogging(loggerFactory, o => o.EnableSensitiveData = true)
    .Build();

Prompt versioning

Tag prompts with version for A/B + rollback:

activity?.SetTag("prompt.version", "v3");

Senior considerations

OTel always: production AI without telemetry = blind.
Cache always: even 5% hit rate saves money.
EnableSensitiveData = false by default; opt-in for debug only.
Per-tenant cost tracking for B2B.
Alerts on cost spikes — abuse / bug detection.