Operations¶

Status: Draft for discussion. Cross-cuts 00-outline, 01-production-requirements-runtime, 02-llm-app-design-concepts, and runtime capabilities.

Tracing and the improvement loop¶

Production agents choose paths at runtime; code review alone does not predict failure modes. Tracing should capture model decisions, tool inputs and outputs, subagent boundaries, middleware actions, and cost. The loop from trace to diagnosis to harness or policy change to redeploy to online eval is the default operating model for mature teams.

Design targets: searchable metadata (user, tenant, error class, latency, spend), retention aligned with compliance, and tight coupling between trace ids and user-reported incidents.

Evaluation and A/B testing¶

Offline JSONL or scenario suites catch regressions before release. Online judges or custom scorers on sampled production traffic catch drift after release. Experiments should define primary metrics (task success, cost per success, time to resolution) and guardrail metrics (unsafe tool calls, PII leakage, human escalation rate).

Open question: whether eval harnesses in this workspace (for example stn-agent-eval) own runtime SLO tests, harness quality tests, or both; see Transmute-Data fit.

Streaming and client contracts¶

Operations owns SLOs for time-to-first-token, stream stall detection, and reconnect behavior documented in Real-time interaction. Runbooks should cover client bugs that look like agent failures (dropped SSE, missing last-event id).

Disaster recovery and versioning¶

Long-running work needs deploy policies: finish on prior graph version, migrate checkpoints, or cancel with user-visible status. Back up checkpoint and store data with tested restore. Version harness artifacts (prompts, skills, tool manifests) so traces remain interpretable after change.

Cost and capacity¶

Monitor tokens, tool call volume, sandbox minutes, and scheduled job fan-out. Alerts on budget burn anomalies pair with harness limits (tool call ceilings, model routing). Capacity planning includes queue depth and worker count for durable execution backlogs.