Skip to content

Harness versus runtime

Status: Draft for discussion. Terminology is intentionally neutral; vendor-specific names appear only in mapping notes.

Definitions

Harness is everything you shape around the model so the agent can succeed in a domain: system and developer prompts, tool schemas, skills, routing rules, and the reason-act-observe loop that turns user intent into completed work. The harness changes when you improve instructions, add tools, or tune how the model plans and recovers from mistakes.

Runtime is everything that keeps that loop alive in production without reimplementing platform mechanics in application code: durable execution, memory stores, authentication and isolation, streaming, observability, sandboxes, integration endpoints, and schedulers. The runtime changes when you harden reliability, tenancy, or operations, not when you rewrite a single prompt.

Product surface (chat UI, APIs, webhooks, cron triggers) sits above both: it submits work, displays partial output, and receives human decisions. A thin client is not a substitute for runtime guarantees when runs are long, concurrent, or failure-prone.

Why the split matters

Prototype agents often collapse harness and runtime into one process. That works until runs exceed a single HTTP request, users overlap messages, deploys interrupt in-flight work, or multiple tenants share one deployment. Production requirements in Production requirements and runtime capabilities mostly land on the runtime side; quality and domain fit mostly land on the harness side.

Confusing the two leads to predictable failures: putting retry and checkpoint logic only in prompts; encoding authorization in tool descriptions instead of request context; or treating trace storage as optional logging rather than the feedback loop for harness changes.

Mapping to the production checklist

Concern Primary owner Runtime capability (see deep dives)
Correct tool choice and task decomposition Harness n/a
Surviving crash or deploy mid-run Runtime Reliability
Remembering this thread versus past conversations Runtime (with harness read/write patterns) Memory
PII redaction and spend ceilings Harness policy expressed in runtime hooks Guardrails
User A cannot read user B’s threads Runtime Multi-tenancy
Approve email before send Harness flow + runtime pause/resume Human oversight
Token streaming and overlapping user messages Runtime (+ UI contract) Real-time interaction
Debug a bad tool loop in production Runtime traces + harness iteration Observability
Run shell commands safely Runtime isolation + harness tool visibility Code execution
Connect to GitHub, Slack, other agents Runtime protocols + harness tool wiring Integrations
Nightly research or alert sweeps Runtime scheduler + harness job definition Scheduled jobs

Reference alignment

The LangChain article The runtime behind production deep agents uses the same harness/runtime split and maps production requirements to LangSmith Deployment and Agent Server primitives. Treat that document as a concrete vendor articulation, not as the only valid implementation.

Open design questions

  • Whether “platform” should subsume both harness packaging (config, skills, MCP manifests) and runtime hosting in one mental model.
  • Where eval harnesses and offline suites live: harness quality gates, runtime regression tests, or both.
  • How much harness logic must be replayable from checkpoints versus re-executed on resume.