Skip to content

Reliability: durable execution

Maps to: Reliability: durable execution (checklist).

Scope

Runs that span many model and tool steps must survive worker loss, deploys, and transient upstream failures without discarding completed work or duplicating side effects carelessly.

Design questions

  • What is the unit of checkpointing (graph super-step, tool call, custom boundary)?
  • Which tool calls are idempotent, compensatable, or strictly once-only?
  • How are leases, heartbeats, and stale-run recovery defined?
  • What retry policy applies per step versus per run?

Tradeoffs

  • Finer checkpoints increase storage and latency; coarser checkpoints increase rework on resume.
  • At-least-once delivery of tool side effects may require outbox patterns or explicit idempotency keys.
  • Synchronous “wait until done” APIs hide durability requirements until the first long run.

Evaluation hooks

  • Inject worker kill mid-run; assert resume from last completed step without full restart.
  • Deploy new harness version during active run; define expected behavior (continue on old graph vs migrate).
  • Measure duplicate side effects under retry storms on flaky tools.

Reference notes

See LangChain runtime article (durable execution, crash recovery figure).