Observability: tracing and time travel¶
Maps to: Observability: tracing, time travel.
Scope¶
Execution trees for production debugging, cost and error analytics, online evaluation, and forked replay from historical checkpoints.
Design questions¶
- Required span metadata (user, tenant, harness version, model, tool, cost).
- Sampling versus full capture for high-volume workloads.
- Who can access traces and for how long; redaction in stored spans.
- Time-travel fork semantics: what re-executes versus what is copied state.
Tradeoffs¶
- Full traces enable fast incident response but increase storage and compliance scope.
- Online LLM judges catch regressions early but add cost and judge bias.
- Time travel accelerates debugging but can diverge from original production randomness unless controlled.
Evaluation hooks¶
- Reproduce reported failure from trace id alone.
- Online eval fires on canary harness change before full rollout.
- Fork from checkpoint and compare tool path under alternate prompt.
Reference notes¶
See LangChain runtime article (improvement loop and time travel figures).