Jun 14, 2026·1 min readResearch Papers

Delegation contracts improve AI agent reviewability, not correctness

A controlled pilot study by Vincent Schmalbach found that explicit software delegation contracts for AI coding agents significantly improved the reviewability of returned work packages but did not improve objective task correctness.

ArXiv·Vincent Schmalbach

Read at source

Composite

6.2

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The study establishes that explicit delegation contracts improve the reviewability of AI coding agent work — not its correctness — reframing the contract as a mechanism for human oversight rather than a driver of agent task performance.

01Study ran 64 agent executions across two model tiers under three conditions: issue-style prompt, explicit delegation contract, and contract with required evidence bundle.
02All 64 runs passed hidden acceptance checks with zero scope violations, regardless of condition.
03Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66).

Summary— our read of the original

Vincent Schmalbach's paper reports a controlled pilot study examining the effects of explicit software delegation contracts on AI coding agent work. The study constructed a dependency-free TypeScript API task environment with seeded defects and documentation gaps, then authored ten tasks across five families. Sixty-four agent executions were run across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored using hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, yielding 192 reviews in total.

The results drew a sharp distinction between correctness and reviewability.

The results drew a sharp distinction between correctness and reviewability. On objective task outcomes, explicit contracts made no difference — all 64 runs passed hidden acceptance checks with zero scope violations. On reviewability, however, contracts had a substantial effect: evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66), and reviewer ambiguity decreased (p = 0.035). Structural artifacts such as changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when the contract explicitly demanded them.

These gains carried a measurable cost: contracts added +13% agent tokens and +38% wall-clock time, with larger effects observed for the weaker model tier. The paper concludes that on these small tasks, delegation contracts bought reviewability rather than correctness — a finding that frames the delegation contract as a tool for human oversight rather than a mechanism for improving agent task performance.

Key facts

01Study ran 64 agent executions across two model tiers under three conditions: issue-style prompt, explicit delegation contract, and contract with required evidence bundle.
02All 64 runs passed hidden acceptance checks with zero scope violations, regardless of condition.
03Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66).
04Reviewer ambiguity decreased with explicit contracts (p = 0.035).
05Structural artifacts like changed-file lists, known-limitations sections, and reviewer checklists appeared mostly or only when demanded by the contract.
06Delegation contracts cost +13% agent tokens and +38% wall-clock time, with larger overhead for the weaker model tier.
07192 total reviews were conducted by three independent condition-blinded model-based reviewers using a fixed rubric.

Topics

#agent-framework #code-generation #benchmarks #safety #developer-tools

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →

Jun 14, 2026·1 min readResearch Papers

Delegation contracts improve AI agent reviewability, not correctness

ArXiv·Vincent Schmalbach

Read at source

Composite

6.2

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Study ran 64 agent executions across two model tiers under three conditions: issue-style prompt, explicit delegation contract, and contract with required evidence bundle.
02All 64 runs passed hidden acceptance checks with zero scope violations, regardless of condition.
03Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66).

Summary— our read of the original

The results drew a sharp distinction between correctness and reviewability.

Key facts

01Study ran 64 agent executions across two model tiers under three conditions: issue-style prompt, explicit delegation contract, and contract with required evidence bundle.
02All 64 runs passed hidden acceptance checks with zero scope violations, regardless of condition.
03Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66).
04Reviewer ambiguity decreased with explicit contracts (p = 0.035).
05Structural artifacts like changed-file lists, known-limitations sections, and reviewer checklists appeared mostly or only when demanded by the contract.
06Delegation contracts cost +13% agent tokens and +38% wall-clock time, with larger overhead for the weaker model tier.
07192 total reviews were conducted by three independent condition-blinded model-based reviewers using a fixed rubric.

Topics

#agent-framework #code-generation #benchmarks #safety #developer-tools

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.