Harness versus runtime¶
Status: Draft for discussion. Terminology is intentionally neutral; vendor-specific names appear only in mapping notes.
Definitions¶
Harness is everything you shape around the model so the agent can succeed in a domain: system and developer prompts, tool schemas, skills, routing rules, and the reason-act-observe loop that turns user intent into completed work. The harness changes when you improve instructions, add tools, or tune how the model plans and recovers from mistakes.
Runtime is everything that keeps that loop alive in production without reimplementing platform mechanics in application code: durable execution, memory stores, authentication and isolation, streaming, observability, sandboxes, integration endpoints, and schedulers. The runtime changes when you harden reliability, tenancy, or operations, not when you rewrite a single prompt.
Product surface (chat UI, APIs, webhooks, cron triggers) sits above both: it submits work, displays partial output, and receives human decisions. A thin client is not a substitute for runtime guarantees when runs are long, concurrent, or failure-prone.
Why the split matters¶
Prototype agents often collapse harness and runtime into one process. That works until runs exceed a single HTTP request, users overlap messages, deploys interrupt in-flight work, or multiple tenants share one deployment. Production requirements in Production requirements and runtime capabilities mostly land on the runtime side; quality and domain fit mostly land on the harness side.
Confusing the two leads to predictable failures: putting retry and checkpoint logic only in prompts; encoding authorization in tool descriptions instead of request context; or treating trace storage as optional logging rather than the feedback loop for harness changes.
Mapping to the production checklist¶
| Concern | Primary owner | Runtime capability (see deep dives) |
|---|---|---|
| Correct tool choice and task decomposition | Harness | n/a |
| Surviving crash or deploy mid-run | Runtime | Reliability |
| Remembering this thread versus past conversations | Runtime (with harness read/write patterns) | Memory |
| PII redaction and spend ceilings | Harness policy expressed in runtime hooks | Guardrails |
| User A cannot read user B’s threads | Runtime | Multi-tenancy |
| Approve email before send | Harness flow + runtime pause/resume | Human oversight |
| Token streaming and overlapping user messages | Runtime (+ UI contract) | Real-time interaction |
| Debug a bad tool loop in production | Runtime traces + harness iteration | Observability |
| Run shell commands safely | Runtime isolation + harness tool visibility | Code execution |
| Connect to GitHub, Slack, other agents | Runtime protocols + harness tool wiring | Integrations |
| Nightly research or alert sweeps | Runtime scheduler + harness job definition | Scheduled jobs |
Reference alignment¶
The LangChain article The runtime behind production deep agents uses the same harness/runtime split and maps production requirements to LangSmith Deployment and Agent Server primitives. Treat that document as a concrete vendor articulation, not as the only valid implementation.
Open design questions¶
- Whether “platform” should subsume both harness packaging (config, skills, MCP manifests) and runtime hosting in one mental model.
- Where eval harnesses and offline suites live: harness quality gates, runtime regression tests, or both.
- How much harness logic must be replayable from checkpoints versus re-executed on resume.