The standard advice for adding observability to a production system is "instrument early." For AI agents, teams usually do the opposite - they ship first, instrument later, and discover they're missing the data they need right when they need it most.
The reason isn't laziness. It's that naive trace capture is slow, and slow observability creates a real dilemma: capture everything and add latency, or skip it and fly blind.
We've spent a lot of time on this problem. Here's the architecture we landed on.
What you actually need to capture
A useful trace for an AI agent isn't just the final input/output pair. It's the full execution context:
- The initial input and any retrieved context (RAG chunks, tool outputs, memory)
- Every intermediate step in a multi-turn or multi-agent flow
- The final output and any structured fields extracted from it
- Timing data at each step
- A session or request ID that links everything together
- The model version and sampling parameters
This is more data than most logging pipelines are designed to handle, especially at high throughput.
The synchronous trap
The simplest implementation is synchronous: after each LLM call, serialize the trace data and write it to storage before returning. This is easy to reason about and guarantees you capture every call.
The problem is latency. Serializing a trace with full context can add 10–50ms depending on payload size. For an agent making 3–5 LLM calls per request, that's 30–250ms of pure overhead before you've done any storage I/O. In workflows where p99 latency matters, this is unacceptable.
The async pipeline
The architecture that works is a two-stage async pipeline:
Stage 1: In-process buffer. Each LLM call writes its trace data to an in-memory ring buffer. The write is non-blocking - you append to the buffer and return immediately. The buffer holds the last N entries (we use N=1000 by default, configurable based on memory budget).
Stage 2: Background flush. A background thread drains the buffer to persistent storage on a configurable interval (default 500ms) or when the buffer hits a threshold fill level. The flush is batched to minimize I/O operations.
The overhead of the in-process buffer write is under 1ms in our benchmarks, usually 200–400µs. The background flush is completely decoupled from the request path.
Handling failures
The tradeoff with async is that you can lose data if the process crashes before a flush completes. For most observability use cases this is acceptable - you need statistically representative sampling, not perfect capture. For audit or compliance use cases, you want synchronous capture on a separate thread with a write-ahead log.
We handle this with a capture mode flag:
- **sampled** (default): async buffer, ~0.3ms overhead, up to 500ms data loss window on crash
- **complete**: synchronous write to WAL, ~8ms overhead, zero data loss
- **critical**: synchronous write with confirmation, ~20ms overhead, for high-stakes flows where you need guaranteed capture
Most production deployments run sampled mode for routine calls and complete mode for calls that touch sensitive data or external systems.
Trace correlation across services
For multi-service architectures, the trace needs to follow the request across service boundaries. We use a propagation header (similar to W3C TraceContext) that gets attached to every outbound call and extracted on every inbound call. The trace assembler correlates spans by the shared trace ID.
This is the same pattern used by distributed tracing systems like Jaeger and Zipkin - we just extend it to capture LLM-specific fields (tokens, model, sampling params) that standard tracing doesn't include.
What you get
At 10,000 calls/day with this architecture, you have a complete audit trail of every LLM interaction with less than 5ms average overhead. That trace data is what feeds your eval sets, your failure replay environments, and your fine-tuning pipelines.
Observability isn't expensive if you build it right. It's one of the few investments in your AI stack that pays for itself immediately.