Skip to content

AI System Design

Observability

ai-system-design

Observability: tracing and time travel¶

Maps to: Observability: tracing, time travel.

Scope¶

Execution trees for production debugging, cost and error analytics, online evaluation, and forked replay from historical checkpoints.

Design questions¶

Required span metadata (user, tenant, harness version, model, tool, cost).
Sampling versus full capture for high-volume workloads.
Who can access traces and for how long; redaction in stored spans.
Time-travel fork semantics: what re-executes versus what is copied state.

Tradeoffs¶

Full traces enable fast incident response but increase storage and compliance scope.
Online LLM judges catch regressions early but add cost and judge bias.
Time travel accelerates debugging but can diverge from original production randomness unless controlled.

Evaluation hooks¶

Reproduce reported failure from trace id alone.
Online eval fires on canary harness change before full rollout.
Fork from checkpoint and compare tool path under alternate prompt.

Reference notes¶

See LangChain runtime article (improvement loop and time travel figures).