Telemetry & Caching
Key Points
UseOpenTelemetry()emits OTel GenAI semantic conventions:gen_ai.system,gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens.UseLogging()structured logs per request.UseDistributedCache()hashes prompt → cached response. Massive cost saver.- Sensitive data: by default, prompts and responses are NOT logged. Opt-in with
EnableSensitiveData = true. - For cost: combine cache + token counters + alerts.
OpenTelemetry middleware
chat = chat.AsBuilder()
.UseOpenTelemetry(sourceName: "MyApp.AI", configure: o =>
{
o.EnableSensitiveData = false; // default; don't log prompts/responses
})
.Build();
Emits per-call:
Activity span: "chat" (or model name)
gen_ai.system: "openai"
gen_ai.request.model: "gpt-4o-mini"
gen_ai.usage.input_tokens: 1234
gen_ai.usage.output_tokens: 567
gen_ai.response.id: "chatcmpl-..."
gen_ai.response.model: "gpt-4o-mini-2024-..."
gen_ai.response.finish_reason: ["stop"]
When EnableSensitiveData=true:
Span event: "gen_ai.user.message" { content: "..." }
Span event: "gen_ai.assistant.message" { content: "..." }
OTel pipeline
builder.Services.AddOpenTelemetry()
.WithTracing(t => t.AddSource("MyApp.AI").AddOtlpExporter())
.WithMetrics(m => m.AddOtlpExporter());
Connects spans to OTLP collector → Datadog / App Insights / Jaeger / etc.
Logging middleware
Logs: - Request start (model, message count). - Response (tokens, finish reason). - Errors.
Default logs metadata only. For full prompts:
Distributed cache middleware
builder.Services.AddStackExchangeRedisCache(o => o.Configuration = redisConn);
chat = chat.AsBuilder()
.UseDistributedCache(sp.GetRequiredService<IDistributedCache>())
.Build();
Hashes (model + messages + options) → cache key. Identical request → cached response. No API call.
new ChatOptions
{
/* ... */
AdditionalProperties = new() { ["cache_ttl"] = TimeSpan.FromHours(1) }
}
Cache hit rate
Aim for 30%+ for production chatbots. Higher = greater savings.
Layered cache
chat = chat.AsBuilder()
.UseDistributedCache(redisCache) // L2
.Build();
// Microsoft.Extensions.AI's HybridCache integration in-progress at time of writing
Semantic caching (advanced)
Identical prompts cache trivially. Semantically similar (paraphrased) doesn't, by default.
For semantic: hash embedding bucket → cached responses.
public class SemanticCache : IChatClient
{
public async Task<ChatResponse> GetResponseAsync(...)
{
var queryEmb = await _embed.GenerateAsync(query);
var nearest = await _vectorCache.SearchAsync(queryEmb, threshold: 0.95);
if (nearest is { } cached) return cached.Response;
var fresh = await _inner.GetResponseAsync(...);
await _vectorCache.UpsertAsync(queryEmb, fresh);
return fresh;
}
}
Trade-off: false positives if threshold too lax.
Token counters
private static readonly Meter _m = new("MyApp.AI");
private static readonly Counter<long> _inTokens = _m.CreateCounter<long>("ai.tokens.input");
private static readonly Counter<long> _outTokens = _m.CreateCounter<long>("ai.tokens.output");
Track per tenant, per feature, per model:
(Or wrap with custom DelegatingChatClient.)
Cost alerts
Per-tenant daily budget:
public class BudgetGuardClient(IChatClient inner, IBudgetService b) : DelegatingChatClient(inner)
{
public override async Task<ChatResponse> GetResponseAsync(...)
{
if (await b.IsExceeded(tenantId)) throw new BudgetExceededException();
var resp = await base.GetResponseAsync(...);
await b.AddAsync(tenantId, resp.Usage?.InputTokenCount ?? 0, resp.Usage?.OutputTokenCount ?? 0);
return resp;
}
}
Per-request observability
[HttpPost("/chat")]
public async Task<IActionResult> Chat(string q)
{
using var activity = _activitySource.StartActivity("chat-request");
activity?.SetTag("user.id", User.FindFirstValue("sub"));
activity?.SetTag("tenant.id", _tenant.Id);
var resp = await _chat.GetResponseAsync(q); // OTel middleware adds GenAI tags
activity?.SetTag("output.length", resp.Text.Length);
return Ok(resp.Text);
}
Sensitive data handling
Don't log prompts/responses by default — they may contain PII, customer data.
For audit / debugging: log to a secure store with retention; redact PII.
chat = chat.AsBuilder()
.Use(c => new RedactingClient(c)) // strips emails, SSN, etc.
.UseLogging(loggerFactory, o => o.EnableSensitiveData = true)
.Build();
Prompt versioning
Tag prompts with version for A/B + rollback:
Senior considerations
- OTel always: production AI without telemetry = blind.
- Cache always: even 5% hit rate saves money.
- EnableSensitiveData = false by default; opt-in for debug only.
- Per-tenant cost tracking for B2B.
- Alerts on cost spikes — abuse / bug detection.