Skip to content

Compiler Paradigm for Agent Output Verification

The compiler analogy: we don't review compiler output because we've invested decades in the apparatus around compilation — type systems, testing, monitoring. The same apparatus is what's missing for coding agents.

Summary

Hugo Venturini of SkipLabs argues that resistance to "lights-out codebases" isn't irrational — it's the correct response to a genuine gap. We haven't built the upstream (formal specification layers), verification (AI-checks-AI CI), and downstream (production instrumentation) infrastructure that makes trusting compiler output reasonable. The compiler didn't eliminate the need for correctness; it relocated enforcement. That relocation is the project in front of us for coding agents.

Details

The Core Analogy

Venturini draws a direct parallel between compilers and coding agents:

  • Compilers are trusted not blindly but because of their surrounding apparatus: type systems constrain output, tests validate against behavior, reproducible builds enable verification, fuzzing/sanitizers/ formal verification provide deeper checks
  • Coding agents produce output that we still treat as "junior developer output" — requiring human eyeballs before it's "real"
  • Code reviews worked when volume was manageable, making them the primary quality gate. They're structurally impossible at agent throughput (e.g., 417 PRs/day)

What's Actually Missing

Upstream gap: No formal specification layers that agents verifiably execute against. Prompts and scaffolding are not type systems. TDD, contract testing, and design-by-specification exist but aren't standard when agents are the author.

Verification gap: Compilers are deterministic — same input always produces same output. Agents are not. A test suite must be substantially more comprehensive to catch plausible-but-subtle errors at 50x the rate the author produces them. AI-checks-AI and dedicated security agents are the right direction but nascent.

Downstream gap: Production monitoring (canary deploys, feature flags, observability) is mature for traditional code but hasn't been habitually applied to agent-generated changes.

Hardware Analogy

Hardware chip companies already use black-box components verified by acceptance tests rather than human review. Chip verification is a discipline with tooling, formal methods, and teams whose sole job is designing test harnesses. Software needs its equivalent — potentially harder because software has "far less formal specification culture than hardware."

Key Claims

  • The volume of agent-generated PRs isn't the core problem — it's a symptom of treating agent output like junior developer output concepts/harness-engineering
  • Trust in compilers is earned through decades of apparatus investment, not blind faith
  • The compiler analogy is clarifying because it shows where quality control moves in a mature pipeline: upstream (specification), middle (eliminated), downstream (verification)
  • Hardware companies already do what software is just beginning to need: black-box components with formal test harnesses
  • The real question isn't "can we trust agent output?" but "have we built the infrastructure that makes that trust reasonable?"

Counter-Arguments & Limits

This framework has been challenged by entities/lars-faye (primary thesis: concepts/agentic-coding-trap), who argues that the entire apparatus-building premise presupposes the existence of humans skilled enough to build and maintain it — and that agentic workflows actively degrade that skill set. The critique: if the very act of adopting agentic coding erodes the engineers' ability to judge code quality, then the apparatus they build may be built on degraded judgment.

Additionally, the "skilled orchestrator problem" means that the apparatus itself may be over-engineered: teams without deep domain expertise may build verification layers that don't catch the right classes of bugs, creating false confidence.

Open Questions

  • What does a "formal specification layer for agents" look like concretely — is it a new language, an extension of existing DSLs, or a formalization of existing patterns like design-by-specification?
  • How do existing formal specification cultures (hardware, Rust) inform what could work for software agent specifications?
  • What's the minimum viable verification layer for agents: AI-checks-AI as CI, or something more structural?
  • Philip Su's original post entities/philip-su — how does Venturini's take differ from Su's "No More Code Reviews" argument?
  • Does the apparatus-building approach solve the problem, or just defer it — given concepts/agentic-coding-trap's claim that the apparatus builders themselves are degrading?

Sources