Apr 20, 2026·1 min readOpinion & Analysis

Harness engineering: the system around AI agents matters most

ShipWithAI argues that "harness engineering" — the memory, tools, permissions, hooks, and observability wrapped around an AI model — is the real driver of production reliability, not the model itself.

Dev.to #claude·ShipWithAI

Read at source

Composite

5.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers building AI coding agents should audit their harness beyond `CLAUDE.md` — implementing `PreToolUse` hooks, MCP tools, permission lists, and observability can yield double-digit reliability gains without touching the underlying model.

01Harness engineering is defined as the discipline of building constraints, tools, feedback loops, and observability around an AI agent — everything except the model itself.
02LangChain gained 13.7 benchmark points on Terminal Bench 2.0 (52.8% → 66.5%), moving from Top 30 to Top 5, using the same model (`gpt-5.2-codex`) with only harness changes.
03The three harness changes LangChain made were: context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation.

Summary— our read of the original

The post by ShipWithAI frames the evolution of AI development practices across three eras: prompt engineering (2022–2024), context engineering (2025), and harness engineering (2026). While prompt engineering shapes what an agent tries and context engineering shapes what it knows, harness engineering shapes what an agent *can and cannot do*. The article uses a pointed example to illustrate the gap: a `CLAUDE.md` instruction like "Never delete production database tables" is advice that competes with 200K tokens of context and can be ignored, whereas a `PreToolUse` hook that detects `DROP TABLE` in a production environment and exits with code `2` is enforcement that cannot be bypassed.

LangChain's results on Terminal Bench 2.0 — a benchmark of 89 real-world terminal tasks — are the article's central evidence.

LangChain's results on Terminal Bench 2.0 — a benchmark of 89 real-world terminal tasks — are the article's central evidence. By adding context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation (the `high` reasoning setting scored 63.6%, while the `xhigh` maximum setting scored only 53.9% due to timeouts), LangChain moved from Top 30 to Top 5 without any model changes. The post also references Mitchell Hashimoto, creator of Terraform and Ghostty, whose `AGENTS.md` file maps each line to a specific past agent failure now prevented, and OpenAI's Codex team, which reportedly shipped roughly one million lines of production code over five months with zero human-written lines, using `AGENTS.md` files, reproducible dev environments, and CI invariants as their harness.

The article defines a five-layer harness model for Claude Code specifically: Layer 1 (Memory via `CLAUDE.md`, `MEMORY.md`, `.claude/commands/`), Layer 2 (Tools via MCP servers), Layer 3 (Permissions via `settings.json` allow/deny lists), Layer 4 (Hooks via `PreToolUse`/`PostToolUse` enforcement), and Layer 5 (Observability via session logs and cost tracking). The post concludes that most developers only have Layer 1, leaving three or more layers of reliability unimplemented.

Key facts

01Harness engineering is defined as the discipline of building constraints, tools, feedback loops, and observability around an AI agent — everything except the model itself.
02LangChain gained 13.7 benchmark points on Terminal Bench 2.0 (52.8% → 66.5%), moving from Top 30 to Top 5, using the same model (`gpt-5.2-codex`) with only harness changes.
03The three harness changes LangChain made were: context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation.
04Counterintuitively, running at maximum reasoning budget (`xhigh`) scored only 53.9% — worse than the pre-harness baseline — due to timeouts, while the `high` setting scored 63.6%.
05Mitchell Hashimoto's Ghostty `AGENTS.md` file maps each line to a specific past agent failure that is now prevented.
06OpenAI's Codex team reportedly shipped roughly one million lines of production code over five months with zero human-written lines, using harness components including `AGENTS.md` files and CI invariants.
07A five-layer harness model is described: Memory, Tools, Permissions, Hooks, and Observability — with most developers only implementing Layer 1.

Topics

#harness-engineering #agent-framework #prompt-engineering #observability #production-systems

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 20, 2026 · 13:29 UTC. How this works →

Apr 20, 2026·1 min readOpinion & Analysis

Harness engineering: the system around AI agents matters most

Dev.to #claude·ShipWithAI

Read at source

Composite

5.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Harness engineering is defined as the discipline of building constraints, tools, feedback loops, and observability around an AI agent — everything except the model itself.
02LangChain gained 13.7 benchmark points on Terminal Bench 2.0 (52.8% → 66.5%), moving from Top 30 to Top 5, using the same model (`gpt-5.2-codex`) with only harness changes.
03The three harness changes LangChain made were: context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation.

Summary— our read of the original

LangChain's results on Terminal Bench 2.0 — a benchmark of 89 real-world terminal tasks — are the article's central evidence.

Key facts

01Harness engineering is defined as the discipline of building constraints, tools, feedback loops, and observability around an AI agent — everything except the model itself.
02LangChain gained 13.7 benchmark points on Terminal Bench 2.0 (52.8% → 66.5%), moving from Top 30 to Top 5, using the same model (`gpt-5.2-codex`) with only harness changes.
03The three harness changes LangChain made were: context injection via `LocalContextMiddleware`, self-verification loops after each action, and optimized compute allocation.
04Counterintuitively, running at maximum reasoning budget (`xhigh`) scored only 53.9% — worse than the pre-harness baseline — due to timeouts, while the `high` setting scored 63.6%.
05Mitchell Hashimoto's Ghostty `AGENTS.md` file maps each line to a specific past agent failure that is now prevented.
06OpenAI's Codex team reportedly shipped roughly one million lines of production code over five months with zero human-written lines, using harness components including `AGENTS.md` files and CI invariants.
07A five-layer harness model is described: Memory, Tools, Permissions, Hooks, and Observability — with most developers only implementing Layer 1.

Topics

#harness-engineering #agent-framework #prompt-engineering #observability #production-systems

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics