Harness engineering: building reliable AI coding agents
Harness engineering is building a complete working environment around an AI coding agent—instructions, state tracking, verification, scope constraints, and session lifecycle—to make agents reliable on real codebases, not just benchmarks.
Score breakdown
Developers using AI coding agents can dramatically improve reliability and success rates on real codebases by implementing a structured harness—instructions, state tracking, verification, scope constraints, and session lifecycle—rather than relying on model strength alone.
- 01SWE-bench Verified top agents achieve ~50–60% pass rate on curated tasks; lower on real repositories
- 02OpenAI's million-line experiment: 3 engineers + Codex → ~1,500 PRs over 5 months (~3.5 PRs/person/day) after heavy harness investment
- 03A FastAPI project with AGENTS.md went from failing all 3 runs to succeeding all 3 runs with ~60% better context efficiency
Harness engineering is building a complete working environment around an AI coding agent so it produces reliable results. It is not about writing better prompts; it is about designing the system the model operates inside. The article, distilled from the Learn Harness Engineering course by WalkingLabs, synthesizes harness engineering theory and practice from OpenAI, Anthropic, and industry practitioners.
OpenAI's million-line experiment involved 3 engineers + Codex producing ~1,500 PRs over 5 months (~3.5 PRs/person/day), but only after investing heavily in harness infrastructure.
Every effective harness has exactly five subsystems: Instructions (tell the agent what to do, in what order, what to read first, via files like AGENTS.md and CLAUDE.md); State (track what's done, in-progress, and next, persisted to disk via claude-progress.md, feature_list.json, and git log); Verification (only passing tests count as evidence; the agent cannot declare victory without proof via tests, lint, type-check, and smoke runs); Scope (constrain the agent to one feature at a time, no overreach); and Session Lifecycle (initialize at start, clean up at end, leave a clean restart path via init.sh and handoff notes).
Real-world evidence demonstrates the impact. SWE-bench Verified top agents achieve only 50–60% pass rates on curated tasks with clear descriptions, and even lower on real repositories. OpenAI's million-line experiment involved 3 engineers + Codex producing ~1,500 PRs over 5 months (~3.5 PRs/person/day), but only after investing heavily in harness infrastructure. A team added AGENTS.md to a FastAPI project and saw the same model go from failing all 3 runs to succeeding all 3 runs, with ~60% better context efficiency. A TypeScript + React project went from 20% success rate (bare repo) to 100% (full harness), with the model never changed. The article provides a minimal harness template: drop AGENTS.md (operating manual), init.sh (environment health check), feature_list.json (machine-readable scope), and claude-progress.md (session memory) into the project root.
Key facts
- 01SWE-bench Verified top agents achieve ~50–60% pass rate on curated tasks; lower on real repositories
- 02OpenAI's million-line experiment: 3 engineers + Codex → ~1,500 PRs over 5 months (~3.5 PRs/person/day) after heavy harness investment
- 03A FastAPI project with AGENTS.md went from failing all 3 runs to succeeding all 3 runs with ~60% better context efficiency
- 04A TypeScript + React project improved from 20% success rate (bare repo) to 100% (full harness) without changing the model