Harness Engineering¶

Summary¶

Harness engineering is the OpenAI Codex team's term for the human engineering role in an agent-first codebase (see also concepts/agent-harness for the general concept): designing environments, specifying intent, and building feedback loops that allow agents to do reliable work — rather than writing code directly. In a five-month experiment starting August 2025, three engineers (later seven) built and shipped an internal product with zero manually-written code using Codex, resulting in ~1 million lines of code and ~1,500 PRs. The primary lessons concern context management, enforced architecture, and continuous entropy cleanup rather than prompting technique.

Details¶

The Experiment¶

Start: August 2025, empty git repository
Constraint: 0 lines of manually-written code — everything (application logic, tests, CI, docs, observability, tooling) written by Codex
Scale: ~1M lines of code; ~1,500 PRs opened and merged; 3 engineers → 7 engineers
Throughput: Average 3.5 PRs per engineer per day — and throughput increased as the team grew
Estimate: Built in ~1/10th the time equivalent hand-written code would have taken
Users: Internal daily users plus external alpha testers; the product ships, breaks, and gets fixed

Redefining Engineering Work¶

Human engineers no longer write code. Their work becomes:

==Prioritization — deciding what to build==
==Translating feedback — converting user/reviewer feedback into acceptance criteria for agents==
==Environment building — giving agents the tools, abstractions, and structure to make progress==
==Outcome validation — verifying that what agents produced is correct==

When something failed, the fix was almost never "try harder." Engineers asked: "What capability is missing, and how do we make it legible and enforceable for the agent?" — then had Codex write the fix.

Context Management: The Table of Contents Approach¶

The biggest lesson in making agents effective at large, complex tasks.

What failed: A single large AGENTS.md ("one big instruction manual"): - Context is scarce — a giant file crowds out the task, code, and relevant docs - When everything is "important," nothing is — agents pattern-match locally instead of navigating - It rots instantly — a monolithic manual becomes stale; agents can't tell what's still true - Hard to verify — a single blob doesn't lend itself to automated freshness/cross-link checks

==What works: AGENTS.md as a table of contents (~100 lines), pointing to a structured docs/ directory as the system of record:==

AGENTS.md              ← short map (~100 lines)
ARCHITECTURE.md
docs/
├── design-docs/       ← design decisions, core beliefs
├── exec-plans/        ← active, completed, tech-debt-tracker
├── generated/         ← db-schema.md (auto-generated)
├── product-specs/     ← indexed product specs
├── references/        ← llms.txt files for key dependencies
├── DESIGN.md
├── FRONTEND.md
├── PLANS.md
├── PRODUCT_SENSE.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md

This enables progressive disclosure: agents start with a small, stable entry point and navigate to deeper sources rather than being overwhelmed upfront.

Freshness is enforced mechanically: dedicated linters and CI jobs validate the knowledge base is up to date, cross-linked, and structured correctly. A recurring "doc-gardening" agent scans for stale documentation and opens fix-up PRs.

Repository as System of Record¶

The key principle: Anything not in the repository doesn't exist from the agent's point of view.

Knowledge in Slack threads, Google Docs, or people's heads is inaccessible to agents
That architectural alignment discussion? If it isn't in the repo, it might as well not have happened
==Pushed toward repo-local, versioned artifacts: code, markdown, schemas, executable plans==

Implications: - Prefer "boring" technologies — composable, stable APIs, well-represented in training data - In some cases, cheaper to have agents reimplement functionality than work around opaque upstream behavior (e.g., implemented a custom map-with-concurrency helper rather than using p-limit, because it could be fully instrumented and tested)

Enforcing Architecture and Taste¶

Documentation alone doesn't keep a fully agent-generated codebase coherent.

Architectural invariants over micromanaged implementations: The codebase uses a rigid domain/layer model with strictly validated dependency directions. Code within a business domain can only depend "forward" through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). These constraints are enforced mechanically via custom linters (also Codex-generated).

Taste invariants: Statically enforced rules for structured logging, naming conventions, file size limits, platform reliability requirements. Error messages in linters are written to inject remediation instructions into agent context.

Effect: "In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once."

The resulting code doesn't always match human stylistic preferences — that's acceptable as long as it's correct, maintainable, and legible to future agent runs.

Increasing Application Legibility¶

As code throughput increased, the bottleneck became human QA capacity. Solution: make the application itself legible to the agent:

UI legibility: Made the app bootable per git worktree; wired Chrome DevTools Protocol into the agent runtime; created skills for DOM snapshots, screenshots, and navigation. Agents can now reproduce bugs, validate fixes, and reason about UI directly.
Observability legibility: Logs, metrics, and traces exposed via a local ephemeral observability stack per worktree. Agents can query logs with LogQL and metrics with PromQL. Prompts like "ensure service startup completes in under 800ms" become tractable.

Single Codex runs regularly work on a task for 6+ hours (while humans sleep).

Agent Autonomy Milestones¶

The repository recently crossed a threshold where a single prompt can drive a complete feature cycle end-to-end:

==Validate current codebase state==
==Reproduce a reported bug==
==Record a video demonstrating the failure==
==Implement a fix==
==Validate the fix by driving the application==
==Record a second video demonstrating resolution==
==Open a pull request==
==Respond to agent and human feedback==
==Detect and remediate build failures==
==Escalate to human only when judgment is required==
==Merge the change==

Note: this behavior depends on the specific structure and tooling of this repository — it should not be assumed to generalize without similar investment.

Entropy and Garbage Collection¶

Full agent autonomy introduces drift: agents replicate patterns that exist in the repo, including uneven or suboptimal ones.

Failed approach: Manual cleanup — the team spent every Friday (20% of the week) cleaning "AI slop." Didn't scale.

==Working approach: "Golden principles" + automated continuous cleanup:== - ==Golden principles: opinionated mechanical rules (e.g., prefer shared utility packages over hand-rolled helpers; validate data at boundaries rather than probing "YOLO-style")== - ==Background Codex tasks scan for deviations, update quality grades, open targeted refactoring PRs on a regular cadence== - ==Most cleanup PRs can be reviewed in under a minute and auto-merged==

Framed as garbage collection: technical debt is a high-interest loan — pay it continuously in small increments rather than in painful bursts.

Merge Philosophy¶

High throughput changes conventional norms: - Minimal blocking merge gates - Short-lived pull requests - Test flakes addressed with follow-up runs rather than blocking progress - "In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive"

This would be irresponsible in a low-throughput environment. At this scale, it's often the right tradeoff.

Key Claims & Data Points¶

0 manually-written lines in a ~1M line codebase shipped to real users — [source: harness_engineering]
~1,500 PRs merged by 3–7 engineers; avg 3.5 PRs/engineer/day — [source: harness_engineering]
Built in ~1/10th the time of equivalent hand-written code — [source: harness_engineering]
Single large AGENTS.md fails; table-of-contents AGENTS.md (~100 lines) + structured docs/ is the working pattern — [source: harness_engineering]
20% of engineering time (Fridays) spent on manual AI slop cleanup before automated garbage collection — [source: harness_engineering]
Single Codex runs regularly run for 6+ hours unattended — [source: harness_engineering]

Open Questions¶

What is the repository's domain/layer architecture rule and how are cross-cutting concerns handled beyond the high-level description? (raised by: concepts/harness-engineering, 2026-04-03)
How does architectural coherence evolve over years in a fully agent-generated system — does drift compound despite garbage collection? (raised by: concepts/harness-engineering, 2026-04-03)
What are the "golden principles" in full — can they be generalized for other codebases? (raised by: concepts/harness-engineering, 2026-04-03)
How does the Aardvark agent (mentioned) relate to Codex and what role does it play in the development loop? (raised by: concepts/harness-engineering, 2026-04-03)

Sources¶

Harness engineering: leveraging Codex in an agent-first world — OpenAI blog post by Ryan Lopopolo, Mar 2026; account of building a million-line product with zero manually-written code
The Anatomy of an Agent Harness — @akshay_pachaar thread (Apr 6, 2026); cross-industry synthesis of the agent harness concept