Harness engineering: the system around AI agents matters most
ShipWithAI argues that "harness engineering" — the memory, tools, permissions, hooks, and observability wrapped around an AI model — is the real driver of production reliability, not the model itself.
Score breakdown
Developers building AI coding agents should audit their harness beyond `CLAUDE.md` — implementing `PreToolUse` hooks, MCP tools, permission lists, and observability can yield double-digit reliability gains without touching the underlying model.
- 01Harness engineering is defined as the discipline of building constraints, tools, feedback loops, and observability around an AI agent — everything except the model itself.
- 02LangChain gained 13.7 benchmark points on Terminal Bench 2.0 (52.8% → 66.5%), moving from Top 30 to Top 5, using the same model (`gpt-5.2-codex`) with only harness changes.
- 03The three harness changes LangChain made were: context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation.
The post by ShipWithAI frames the evolution of AI development practices across three eras: prompt engineering (2022–2024), context engineering (2025), and harness engineering (2026). While prompt engineering shapes what an agent tries and context engineering shapes what it knows, harness engineering shapes what an agent *can and cannot do*. The article uses a pointed example to illustrate the gap: a `CLAUDE.md` instruction like "Never delete production database tables" is advice that competes with 200K tokens of context and can be ignored, whereas a `PreToolUse` hook that detects `DROP TABLE` in a production environment and exits with code `2` is enforcement that cannot be bypassed.
LangChain's results on Terminal Bench 2.0 — a benchmark of 89 real-world terminal tasks — are the article's central evidence.
LangChain's results on Terminal Bench 2.0 — a benchmark of 89 real-world terminal tasks — are the article's central evidence. By adding context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation (the `high` reasoning setting scored 63.6%, while the `xhigh` maximum setting scored only 53.9% due to timeouts), LangChain moved from Top 30 to Top 5 without any model changes. The post also references Mitchell Hashimoto, creator of Terraform and Ghostty, whose `AGENTS.md` file maps each line to a specific past agent failure now prevented, and OpenAI's Codex team, which reportedly shipped roughly one million lines of production code over five months with zero human-written lines, using `AGENTS.md` files, reproducible dev environments, and CI invariants as their harness.
The article defines a five-layer harness model for Claude Code specifically: Layer 1 (Memory via `CLAUDE.md`, `MEMORY.md`, `.claude/commands/`), Layer 2 (Tools via MCP servers), Layer 3 (Permissions via `settings.json` allow/deny lists), Layer 4 (Hooks via `PreToolUse`/`PostToolUse` enforcement), and Layer 5 (Observability via session logs and cost tracking). The post concludes that most developers only have Layer 1, leaving three or more layers of reliability unimplemented.
Key facts
- 01Harness engineering is defined as the discipline of building constraints, tools, feedback loops, and observability around an AI agent — everything except the model itself.
- 02LangChain gained 13.7 benchmark points on Terminal Bench 2.0 (52.8% → 66.5%), moving from Top 30 to Top 5, using the same model (`gpt-5.2-codex`) with only harness changes.
- 03The three harness changes LangChain made were: context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation.