Delegation contracts improve AI agent reviewability, not correctness
A controlled pilot study by Vincent Schmalbach found that explicit software delegation contracts for AI coding agents significantly improved the reviewability of returned work packages but did not improve objective task correctness.
Score breakdown
The study establishes that explicit delegation contracts improve the reviewability of AI coding agent work — not its correctness — reframing the contract as a mechanism for human oversight rather than a driver of agent task performance.
- 01Study ran 64 agent executions across two model tiers under three conditions: issue-style prompt, explicit delegation contract, and contract with required evidence bundle.
- 02All 64 runs passed hidden acceptance checks with zero scope violations, regardless of condition.
- 03Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66).
Vincent Schmalbach's paper reports a controlled pilot study examining the effects of explicit software delegation contracts on AI coding agent work. The study constructed a dependency-free TypeScript API task environment with seeded defects and documentation gaps, then authored ten tasks across five families. Sixty-four agent executions were run across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored using hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, yielding 192 reviews in total.
The results drew a sharp distinction between correctness and reviewability.
The results drew a sharp distinction between correctness and reviewability. On objective task outcomes, explicit contracts made no difference — all 64 runs passed hidden acceptance checks with zero scope violations. On reviewability, however, contracts had a substantial effect: evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66), and reviewer ambiguity decreased (p = 0.035). Structural artifacts such as changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when the contract explicitly demanded them.
These gains carried a measurable cost: contracts added +13% agent tokens and +38% wall-clock time, with larger effects observed for the weaker model tier. The paper concludes that on these small tasks, delegation contracts bought reviewability rather than correctness — a finding that frames the delegation contract as a tool for human oversight rather than a mechanism for improving agent task performance.
Key facts
- 01Study ran 64 agent executions across two model tiers under three conditions: issue-style prompt, explicit delegation contract, and contract with required evidence bundle.
- 02All 64 runs passed hidden acceptance checks with zero scope violations, regardless of condition.
- 03Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66).
- 04Reviewer ambiguity decreased with explicit contracts (p = 0.035).
- 05Structural artifacts like changed-file lists, known-limitations sections, and reviewer checklists appeared mostly or only when demanded by the contract.
- 06Delegation contracts cost +13% agent tokens and +38% wall-clock time, with larger overhead for the weaker model tier.
- 07192 total reviews were conducted by three independent condition-blinded model-based reviewers using a fixed rubric.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →