Skip to content

Agent Harness

Summary

The agent harness is the complete software infrastructure wrapping an LLM that transforms it from a stateless text generator into a capable agent: orchestration loop, tools, memory, context management, state persistence, error handling, and guardrails. The term was formalized in early 2026, with Anthropic's Claude Code documentation describing the SDK as "the agent harness that powers Claude Code" and OpenAI's Codex team equating "agent" and "harness" to mean the non-model infrastructure. The canonical formula, from LangChain's Vivek Trivedy: "If you're not the model, you're the harness."

Details

The Core Distinction

The "agent" is the emergent behavior — the goal-directed, tool-using, self-correcting entity the user interacts with. The harness is the machinery producing that behavior. When someone says "I built an agent," they built a harness and pointed it at a model.

Beren Millidge's 2023 essay "Scaffolded LLMs as Natural Language Computers" made this precise: a raw LLM is a CPU with no RAM, no disk, and no I/O. The context window is RAM (fast but limited), external databases are disk (large but slow), tool integrations are device drivers, and the harness is the operating system. Millidge's claim: "We have reinvented the Von Neumann architecture" — because it is a natural abstraction for any computing system. — [source: anatomy-of-agent-harness]

Evidence that harness design matters more than model choice: LangChain changed only the infrastructure wrapping their LLM (same model, same weights) and jumped from outside the top 30 to rank 5 on TerminalBench 2.0. A separate project hit a 76.4% pass rate by having an LLM optimize the infrastructure itself, surpassing hand-designed systems. — [source: anatomy-of-agent-harness]

Three Levels of Engineering

Three concentric levels of engineering surround the model:

  1. Prompt engineering — crafts the instructions the model receives
  2. Context engineering — manages what the model sees and when
  3. Harness engineering — encompasses both, plus the entire application infrastructure: tool orchestration, state persistence, error recovery, verification loops, safety enforcement, and lifecycle management

The harness is not a wrapper around a prompt. It is the complete system that makes autonomous agent behavior possible. See concepts/harness-engineering for OpenAI's Codex team's detailed treatment of harness engineering in practice.

The 12 Components of a Production Harness

Synthesized across Anthropic, OpenAI, LangChain, and the broader practitioner community:

1. Orchestration Loop

The heartbeat: implements the Thought-Action-Observation (TAO) cycle, also called the ReAct loop. Assemble prompt → call LLM → parse output → execute tool calls → feed results back → repeat until done. Anthropic describes their runtime as a "dumb loop" where all intelligence lives in the model. — [source: anatomy-of-agent-harness]

2. Tools

Defined as schemas (name, description, parameter types) injected into the LLM's context. The tool layer handles registration, schema validation, argument extraction, sandboxed execution, result capture, and formatting. Claude Code provides tools across six categories: file operations, search, execution, web access, code intelligence, and subagent spawning. OpenAI's Agents SDK supports function tools, hosted tools (WebSearch, CodeInterpreter, FileSearch), and MCP server tools. — [source: anatomy-of-agent-harness]

3. Memory

Operates at multiple timescales. Short-term: conversation history within a single session. Long-term: persists across sessions — Anthropic uses CLAUDE.md and auto-generated MEMORY.md files; LangGraph uses namespace-organized JSON Stores; OpenAI supports Sessions backed by SQLite or Redis. Claude Code implements a three-tier hierarchy: lightweight index (~150 chars/entry, always loaded), detailed topic files on demand, and raw transcripts via search only. Critical principle: the agent treats its own memory as a "hint" and verifies against actual state before acting. — [source: anatomy-of-agent-harness]

4. Context Management

Where many agents fail silently. Model performance degrades 30%+ when key content falls in mid-window positions (Chroma research, corroborated by Stanford's "Lost in the Middle"). Production strategies:

  • Compaction — summarizing conversation history when approaching limits (Claude Code preserves architectural decisions and unresolved bugs, discards redundant tool outputs)
  • Observation masking — JetBrains' Junie hides old tool outputs while keeping tool calls visible
  • Just-in-time retrieval — maintaining lightweight identifiers and loading data dynamically (Claude Code uses grep, glob, head, tail rather than loading full files)
  • Sub-agent delegation — each subagent explores extensively but returns only 1,000–2,000 token condensed summaries

Goal: find the smallest possible set of high-signal tokens that maximize likelihood of the desired outcome. — [source: anatomy-of-agent-harness]

5. Prompt Construction

Assembles what the model sees at each step. Hierarchical: system prompt, tool definitions, memory files, conversation history, current user message. OpenAI's Codex uses a strict priority stack: server-controlled system message (highest), tool definitions, developer instructions, user instructions (cascading AGENTS.md, 32 KiB limit), then conversation history. — [source: anatomy-of-agent-harness]

6. Output Parsing

Modern harnesses rely on native tool calling (structured tool_calls objects rather than free-text parsing). The harness checks: tool calls present? Execute and loop. No tool calls? Final answer. For structured outputs, both OpenAI and LangChain support schema-constrained responses via Pydantic models. — [source: anatomy-of-agent-harness]

7. State Management

LangGraph models state as typed dictionaries flowing through graph nodes with reducers merging updates; checkpointing at super-step boundaries enables resume and time-travel debugging. OpenAI offers four mutually exclusive strategies: application memory, SDK sessions, Conversations API, or lightweight previous_response_id chaining. Claude Code uses git commits as checkpoints and progress files as structured scratchpads. — [source: anatomy-of-agent-harness]

8. Error Handling

A 10-step process with 99% per-step success still has only ~90.4% end-to-end success — errors compound fast. LangGraph distinguishes four error types: transient (retry with backoff), LLM-recoverable (return error as ToolMessage so the model adjusts), user-fixable (interrupt for human input), unexpected (bubble up for debugging). Anthropic catches failures within tool handlers and returns them as error results. Stripe caps retries at two. — [source: anatomy-of-agent-harness]

9. Guardrails and Safety

OpenAI's SDK implements three levels: input guardrails (on first agent), output guardrails (on final output), tool guardrails (on every invocation). A "tripwire" mechanism halts the agent immediately when triggered. Anthropic separates permission enforcement from model reasoning architecturally: the model decides what to attempt; the tool system decides what's allowed. Claude Code gates ~40 discrete tool capabilities independently across three stages: trust establishment at project load, permission check before each tool call, explicit user confirmation for high-risk operations. — [source: anatomy-of-agent-harness]

10. Verification Loops

What separates demos from production. Anthropic recommends three approaches: rules-based feedback (tests, linters, type checkers), visual feedback (screenshots via Playwright for UI tasks), and LLM-as-judge (separate subagent evaluates output). Boris Cherny (creator of Claude Code) noted that giving the model a way to verify its work improves quality by 2–3×. — [source: anatomy-of-agent-harness]

11. Subagent Orchestration

Claude Code supports three execution models: Fork (byte-identical copy of parent context), Teammate (separate terminal pane with file-based mailbox), and Worktree (own git worktree, isolated branch per agent). OpenAI's SDK supports agents-as-tools (specialist handles bounded subtask) and handoffs (specialist takes full control). LangGraph implements subagents as nested state graphs. — [source: anatomy-of-agent-harness]

The TAO Loop in Motion

A single cycle traces through seven steps:

  1. Prompt Assembly — construct full input (important context at beginning and end per "Lost in the Middle")
  2. LLM Inference — send to model API; generate text, tool calls, or both
  3. Output Classification — text only → end; tool calls → execute; handoff → switch agent
  4. Tool Execution — validate arguments, check permissions, execute in sandbox, capture results (read-only concurrent; mutating serial)
  5. Result Packaging — format as LLM-readable messages; errors returned as error results for self-correction
  6. Context Update — append to history; trigger compaction if approaching window limit
  7. Loop — return to step 1

Termination conditions are layered: no tool calls, max turn limit, token budget exhausted, guardrail tripwire, user interrupt, or safety refusal. Simple questions: 1–2 turns. Complex refactoring: dozens of tool calls across many turns.

For long-running tasks spanning multiple context windows, Anthropic developed a two-phase "Ralph Loop": an Initializer Agent sets up the environment (init script, progress file, feature list, initial git commit), then a Coding Agent in every subsequent session reads git logs and progress files to orient itself, picks the highest-priority incomplete feature, works on it, commits, and writes summaries. The filesystem provides continuity. — [source: anatomy-of-agent-harness]

Framework Implementations

Anthropic's Claude Agent SDK — exposes the harness through a single query() function returning an async iterator. The runtime is a "dumb loop." Claude Code uses a Gather-Act-Verify cycle: gather context (search, read) → take action (edit, run) → verify (test, check) → repeat. — [source: anatomy-of-agent-harness]

OpenAI's Agents SDK — implements the harness through the Runner class (async, sync, streamed modes). Code-first: workflow logic in native Python, not graph DSLs. Codex extends with a three-layer architecture: Codex Core (agent + runtime), App Server (bidirectional JSON-RPC API), client surfaces (CLI, VS Code, web app). All surfaces share the same harness. — [source: anatomy-of-agent-harness]

LangGraph — models the harness as an explicit state graph. Two nodes (llm_call and tool_node) connected by a conditional edge. Evolved from LangChain's deprecated AgentExecutor (v0.2). LangChain's Deep Agents use the term "agent harness" explicitly: built-in tools, planning (write_todos), file systems for context, subagent spawning, persistent memory. — [source: anatomy-of-agent-harness]

CrewAI — role-based multi-agent: Agent (harness around LLM, defined by role/goal/backstory/tools), Task (unit of work), Crew (collection). Flows layer adds a "deterministic backbone with intelligence where it matters." — [source: anatomy-of-agent-harness]

AutoGen (→ Microsoft Agent Framework) — conversation-driven orchestration. Three-layer architecture (Core, AgentChat, Extensions); five patterns: sequential, concurrent (fan-out/fan-in), group chat, handoff, magentic (manager agent with dynamic task ledger). — [source: anatomy-of-agent-harness]

The Scaffolding Metaphor

Scaffolding is temporary infrastructure that enables builders to reach places they couldn't otherwise. It doesn't do the construction. The key insight: scaffolding is removed when the building is complete. As models improve, harness complexity should decrease. Manus was rebuilt five times in six months, each rewrite removing complexity — complex tool definitions became general shell execution; "management agents" became simple structured handoffs.

Co-evolution principle: models are now post-trained with specific harnesses in the loop. Claude Code's model learned to use the specific harness it was trained with. Changing tool implementations can degrade performance because of this tight coupling.

The "future-proofing test": if performance scales up with more powerful models without adding harness complexity, the design is sound. — [source: anatomy-of-agent-harness]

Seven Decisions That Define Every Harness

  1. Single-agent vs. multi-agent — both Anthropic and OpenAI say: maximize a single agent first. Multi-agent adds overhead. Split only when tool overload exceeds ~10 overlapping tools or clearly separate task domains exist
  2. ReAct vs. plan-and-execute — ReAct interleaves reasoning and action (flexible, higher per-step cost). Plan-and-execute separates them. LLMCompiler reports 3.6× speedup over sequential ReAct
  3. Context window management — five strategies: time-based clearing, conversation summarization, observation masking, structured note-taking, sub-agent delegation. ACON research showed 26–54% token reduction while preserving 95%+ accuracy
  4. Verification loop design — computational (tests, linters: deterministic) vs. inferential (LLM-as-judge: catches semantic issues but adds latency). Martin Fowler's Thoughtworks frames as "guides" (feedforward) vs. "sensors" (feedback)
  5. Permission and safety architecture — permissive (fast, risky) vs. restrictive (safe, slow); depends on deployment context
  6. Tool scoping strategy — more tools often means worse performance. Vercel removed 80% of tools from v0 and got better results. Claude Code achieves 95% context reduction via lazy loading. Principle: minimum tool set for the current step
  7. Harness thickness — how much logic lives in the harness vs. the model. Anthropic bets on thin harnesses and model improvement. Graph-based frameworks bet on explicit control. Anthropic regularly deletes planning steps as new model versions internalize that capability

[source: anatomy-of-agent-harness]

Key Claims & Data Points

  • LangChain jumped from outside top 30 to rank 5 on TerminalBench 2.0 by changing only the harness (same model, same weights) — [source: anatomy-of-agent-harness]
  • Separate project hit 76.4% pass rate by having an LLM optimize its own infrastructure — [source: anatomy-of-agent-harness]
  • Model performance degrades 30%+ when key content falls in mid-window positions ("Lost in the Middle") — [source: anatomy-of-agent-harness]
  • A 10-step process at 99% per-step success yields only ~90.4% end-to-end success — [source: anatomy-of-agent-harness]
  • Verification loops improve quality 2–3× (Boris Cherny, Claude Code creator) — [source: anatomy-of-agent-harness]
  • LLMCompiler reports 3.6× speedup over sequential ReAct — [source: anatomy-of-agent-harness]
  • ACON research: 26–54% token reduction preserving 95%+ accuracy — [source: anatomy-of-agent-harness]
  • Vercel removed 80% of tools from v0 and got better results — [source: anatomy-of-agent-harness]
  • Claude Code achieves 95% context reduction via lazy loading — [source: anatomy-of-agent-harness]
  • Claude Code gates ~40 discrete tool capabilities independently — [source: anatomy-of-agent-harness]
  • Manus rebuilt five times in six months, each time removing harness complexity — [source: anatomy-of-agent-harness]

Open Questions

  • How does the co-evolution principle affect portability — can a harness designed for Claude work well with GPT or Gemini, or is tight model-harness coupling the norm? (raised by: concepts/agent-harness, 2026-04-16)
  • What is the empirical cost/benefit of multi-agent vs. single-agent for real production systems — where does the ~10 tool threshold come from? (raised by: concepts/agent-harness, 2026-04-16)
  • How does ACON's 26–54% token reduction technique work in detail, and is it implementable in existing frameworks? (raised by: concepts/agent-harness, 2026-04-16)
  • What does the "Ralph Loop" look like in non-coding domains — can the Initializer/Coding Agent pattern generalize to research, writing, or data analysis? (raised by: concepts/agent-harness, 2026-04-16)

Sources

  • The Anatomy of an Agent Harness — @akshay_pachaar thread (Apr 6, 2026); cross-industry synthesis of agent harness components across Anthropic, OpenAI, LangChain, CrewAI, and AutoGen