Harness versus runtime¶

Status: Draft for discussion. Terminology is intentionally neutral; vendor-specific names appear only in mapping notes.

Definitions¶

Harness is everything you shape around the model so the agent can succeed in a domain: system and developer prompts, tool schemas, skills, routing rules, and the reason-act-observe loop that turns user intent into completed work. The harness changes when you improve instructions, add tools, or tune how the model plans and recovers from mistakes.

Runtime is everything that keeps that loop alive in production without reimplementing platform mechanics in application code: durable execution, memory stores, authentication and isolation, streaming, observability, sandboxes, integration endpoints, and schedulers. The runtime changes when you harden reliability, tenancy, or operations, not when you rewrite a single prompt.

Product surface (chat UI, APIs, webhooks, cron triggers) sits above both: it submits work, displays partial output, and receives human decisions. A thin client is not a substitute for runtime guarantees when runs are long, concurrent, or failure-prone.

Why the split matters¶

Prototype agents often collapse harness and runtime into one process. That works until runs exceed a single HTTP request, users overlap messages, deploys interrupt in-flight work, or multiple tenants share one deployment. Production requirements in Production requirements and runtime capabilities mostly land on the runtime side; quality and domain fit mostly land on the harness side.

Confusing the two leads to predictable failures: putting retry and checkpoint logic only in prompts; encoding authorization in tool descriptions instead of request context; or treating trace storage as optional logging rather than the feedback loop for harness changes.

Mapping to the production checklist¶

Concern	Primary owner	Runtime capability (see deep dives)
Correct tool choice and task decomposition	Harness	n/a
Surviving crash or deploy mid-run	Runtime	Reliability
Remembering this thread versus past conversations	Runtime (with harness read/write patterns)	Memory
PII redaction and spend ceilings	Harness policy expressed in runtime hooks	Guardrails
User A cannot read user B’s threads	Runtime	Multi-tenancy
Approve email before send	Harness flow + runtime pause/resume	Human oversight
Token streaming and overlapping user messages	Runtime (+ UI contract)	Real-time interaction
Debug a bad tool loop in production	Runtime traces + harness iteration	Observability
Run shell commands safely	Runtime isolation + harness tool visibility	Code execution
Connect to GitHub, Slack, other agents	Runtime protocols + harness tool wiring	Integrations
Nightly research or alert sweeps	Runtime scheduler + harness job definition	Scheduled jobs

Reference alignment¶

The LangChain article The runtime behind production deep agents uses the same harness/runtime split and maps production requirements to LangSmith Deployment and Agent Server primitives. Treat that document as a concrete vendor articulation, not as the only valid implementation.

Open design questions¶

Whether “platform” should subsume both harness packaging (config, skills, MCP manifests) and runtime hosting in one mental model.
Where eval harnesses and offline suites live: harness quality gates, runtime regression tests, or both.
How much harness logic must be replayable from checkpoints versus re-executed on resume.