ESSAY

Agentic harness engineering.

The substrate nobody writes about. Why your agent project keeps eating quarters.

This is for engineering and product teams that have already shipped at least one agent to production and hit the wall — context overflows, no audit trail, retries causing duplicate actions. If you're still in prototype stage, the five-layers section is still useful, but the antipatterns section is where it will hit hardest.

What a harness actually is.

An agent is a model with instructions, tools, and a task. A harness is the infrastructure that manages the agent's execution environment — its context, its memory, its tool access, its planning loop, its evaluation mechanism, and its transport layer.

Think of the harness as the operating system for the agent. The agent is the application. Without an operating system, the application can't run reliably at scale. Without a harness, the agent can run a demo but can't be trusted with anything that matters.

The distinction matters because most teams conflate the two. They build a clever agent prompt, wire it to a few tool calls, and call it done. Then they hit production and discover: context windows overflow, the agent doesn't remember what it did yesterday, there's no audit trail, retries crash the state, and adding a new tool requires a rewrite.

These are not agent problems. They are harness problems.

The five layers.

Every production agent system needs all five of these layers. You can build them in any order. You cannot skip them.

01

Tool Registry

A catalog of every tool the agent can call, with schemas, permissions, and rate limits.

Without a registry, tools are hardcoded into the agent prompt. Adding a new tool means touching the agent. Removing a deprecated tool means a prompt rewrite. At scale, this breaks. The registry decouples the agent from its tool surface.

Signs you don't have this yet

  • Agent prompts are getting long because tool definitions are inline
  • Adding a new API requires changing the system prompt
  • You have no idea which tools are actually being called in production
02

Memory

Persistent storage of context across sessions — not just the current conversation window.

LLM context windows are expensive and temporary. An agent that can't remember what it did last Tuesday is not a production system — it's a chatbot. Memory includes episodic (what happened), semantic (what's known), and procedural (how to do things) layers. Most teams build only episodic memory and wonder why their agent keeps re-learning the same things.

Signs you don't have this yet

  • Users have to re-explain context in every session
  • The agent produces conflicting results because it doesn't know its own prior decisions
  • Context window costs are high because everything is in-window instead of retrieved
03

Planner

The loop that decides what to do next, given the current state and the goal.

Simple tool-call chains are not a planner. A planner is the component that can decompose a complex goal into subtasks, handle failure, re-route when a tool returns unexpected results, and terminate when the goal is achieved. Without a planner, your agent is a one-shot responder wearing a tool-call costume.

Signs you don't have this yet

  • The agent fails on multi-step tasks that require state tracking
  • Failures are silent — the agent returns a wrong answer instead of reporting an error
  • The agent can't recover from a failed tool call without human intervention
04

Evaluator

The mechanism that assesses whether the agent's output is good enough to use.

Shipping an agent without an evaluator is shipping a system you can't measure. The evaluator runs after each agent action (or at the end of a sequence) and scores the output against defined criteria. This is what allows you to catch regressions, route to a better model when quality drops, and prove to stakeholders that the system is working.

Signs you don't have this yet

  • You have no way to know if the agent's quality has regressed after a model update
  • There are no quality metrics in your monitoring dashboard
  • You evaluate agent output by reading it yourself
05

Transport

The layer that moves messages, events, and artifacts between the agent, the tools, the memory system, and the human interface.

Transport is often the last thing teams think about and the first thing that breaks in production. It includes the queue that prevents duplicate executions, the delivery guarantees that ensure the agent acts exactly once, the event log that makes the agent's behavior auditable, and the human-in-the-loop channel that lets a person approve or override.

Signs you don't have this yet

  • You have no audit trail of what the agent did and when
  • Retries cause duplicate actions
  • There's no mechanism to pause the agent and ask a human before proceeding

Common antipatterns.

These are the patterns I see repeatedly across agent builds. Each one is a debt that compounds.

Replay-less loops

The agent has no record of what it did in a previous run. If something goes wrong, you can't replay the execution, diagnose the failure point, or resume from where it stopped. Every failure is a full restart. This is the single most expensive antipattern in production agent systems.

Naive context windows

Everything goes into the context window — conversation history, tool results, user documents, agent reasoning. Context window costs explode. Latency climbs. The model's attention gets diluted across a massive prompt. The fix: retrieval-augmented context. Put what the agent needs in-window; retrieve everything else on demand.

No audit trail

You can't tell what the agent did, what tools it called, what decisions it made, or why. This is not just a debugging problem — it's a trust problem. Enterprise buyers will not adopt an agent system with no audit trail. Regulated industries cannot adopt it. Build the audit log on day one.

Vendor lock-in on the harness layer

Using a vendor's agent framework (LangChain, LlamaIndex, CrewAI, etc.) for the harness layer means you inherit their abstractions, their limitations, and their upgrade cycle. These frameworks are useful for prototyping. They are not production harnesses. The harness should be code you own and can change without a framework migration.

One model for everything

Using the most capable (most expensive) model for every task is a cost and latency mistake. Classification tasks don't need GPT-4. Summaries don't need Claude Opus. A routing layer that picks the cheapest model that can hit the bar for each task type is not premature optimization — it's basic production discipline.

Build or buy.

The default advice is “don't build what you can buy.” That advice applies to the agent — not the harness.

Buy the model.You should not be training foundation models. OpenAI, Anthropic, Google, and Mistral have invested billions in training runs you can't replicate. Use their APIs.

Buy commodity harness components selectively. Memory storage (Postgres, vector databases) is infrastructure — buy it. Message queues are infrastructure — buy it. Monitoring dashboards — buy it.

Build the harness logic. The tool registry schema, the planning loop, the evaluation criteria, the routing logic, the audit format — these are business logic. They encode your specific requirements, your risk tolerance, your compliance constraints. You cannot outsource this to a vendor and expect it to fit your context.

The rule I use: if it's business logic, build it. If it's infrastructure, buy it. The harness layer is almost entirely business logic.

Reference architecture.

This is the architecture I use and recommend for production agent systems. It's not the only valid design — but it's the one that handles the failure modes I've seen most often.

Key design decisions:

  • The Transport Layer is between the human interface and the planner — not bolted on at the end. Every action goes through the queue. Every event is logged.
  • The Evaluator is a first-class component, not a post-processing step. It runs inline in the planning loop.
  • Model routing lives inside the evaluator, not in the planner. The evaluator decides whether the output is good enough, and whether to retry with a better model.
  • The Tool Registry is separate from the agent. Agents call the registry; the registry handles auth, rate limiting, and schema validation.

The harness is the moat.

Models are commoditizing. GPT-4 from 18 months ago is now matched by open-source models you can run locally. The capability gap between model providers is narrowing, not widening.

The harness is where the durable advantage lives. The team with the best tool registry, the most robust memory architecture, the most trustworthy evaluator, and the cleanest transport layer will outcompete teams with a fancier model prompt — every time, in production.

Most teams are not thinking about this yet. They will be in 18 months.

The question is whether you want to build the harness right the first time, or re-architect it after the debt has compounded.

Build the harness right the first time.

I run workshops for engineering and product teams that are ready to move past prototype patterns. Two days on-site: reference architecture walk-through, decision log, build plan — customized to your stack and the harness gaps your team has already hit.