Apr 20, 2026·1 min readTutorials & How-To

Harness engineering: building reliable AI coding agents

Harness engineering is building a complete working environment around an AI coding agent—instructions, state tracking, verification, scope constraints, and session lifecycle—to make agents reliable on real codebases, not just benchmarks.

Dev.to #llm·Truong Phung

Read at source

Composite

5.6

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers using AI coding agents can dramatically improve reliability and success rates on real codebases by implementing a structured harness—instructions, state tracking, verification, scope constraints, and session lifecycle—rather than relying on model strength alone.

01SWE-bench Verified top agents achieve ~50–60% pass rate on curated tasks; lower on real repositories
02OpenAI's million-line experiment: 3 engineers + Codex → ~1,500 PRs over 5 months (~3.5 PRs/person/day) after heavy harness investment
03A FastAPI project with AGENTS.md went from failing all 3 runs to succeeding all 3 runs with ~60% better context efficiency

Summary— our read of the original

Harness engineering is building a complete working environment around an AI coding agent so it produces reliable results. It is not about writing better prompts; it is about designing the system the model operates inside. The article, distilled from the Learn Harness Engineering course by WalkingLabs, synthesizes harness engineering theory and practice from OpenAI, Anthropic, and industry practitioners.

OpenAI's million-line experiment involved 3 engineers + Codex producing ~1,500 PRs over 5 months (~3.5 PRs/person/day), but only after investing heavily in harness infrastructure.

Every effective harness has exactly five subsystems: Instructions (tell the agent what to do, in what order, what to read first, via files like AGENTS.md and CLAUDE.md); State (track what's done, in-progress, and next, persisted to disk via claude-progress.md, feature_list.json, and git log); Verification (only passing tests count as evidence; the agent cannot declare victory without proof via tests, lint, type-check, and smoke runs); Scope (constrain the agent to one feature at a time, no overreach); and Session Lifecycle (initialize at start, clean up at end, leave a clean restart path via init.sh and handoff notes).

Real-world evidence demonstrates the impact. SWE-bench Verified top agents achieve only 50–60% pass rates on curated tasks with clear descriptions, and even lower on real repositories. OpenAI's million-line experiment involved 3 engineers + Codex producing ~1,500 PRs over 5 months (~3.5 PRs/person/day), but only after investing heavily in harness infrastructure. A team added AGENTS.md to a FastAPI project and saw the same model go from failing all 3 runs to succeeding all 3 runs, with ~60% better context efficiency. A TypeScript + React project went from 20% success rate (bare repo) to 100% (full harness), with the model never changed. The article provides a minimal harness template: drop AGENTS.md (operating manual), init.sh (environment health check), feature_list.json (machine-readable scope), and claude-progress.md (session memory) into the project root.

Key facts

01SWE-bench Verified top agents achieve ~50–60% pass rate on curated tasks; lower on real repositories
02OpenAI's million-line experiment: 3 engineers + Codex → ~1,500 PRs over 5 months (~3.5 PRs/person/day) after heavy harness investment
03A FastAPI project with AGENTS.md went from failing all 3 runs to succeeding all 3 runs with ~60% better context efficiency
04A TypeScript + React project improved from 20% success rate (bare repo) to 100% (full harness) without changing the model
05Every effective harness has five subsystems: Instructions, State, Verification, Scope, and Session Lifecycle
06Harness engineering is NOT about better prompts; it is about designing the system the model operates inside
07Minimal harness template includes AGENTS.md, init.sh, feature_list.json, and claude-progress.md

Topics

#harness-engineering #prompt-engineering #agent-framework #coding-assistant #developer-tools

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 20, 2026 · 11:01 UTC. How this works →

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics