Jun 16, 2026·1 min readResearch Papers

Coding benchmarks are misaligned with agentic software engineering

A paper by Gorinova, Baker, and Heineike argues that current coding benchmarks were designed in a pre-agent era and fail to evaluate the composite systems that modern coding agents actually are.

ArXiv·Maria I. Gorinova, Macey Baker, Amy Heineike

Read at source

Composite

6.1

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Because any single harness component can move benchmark scores by margins comparable to those between adjacent model generations, end-to-end scores can misattribute performance gains and mislead practitioners trying to improve agentic systems.

01Coding agents are composite systems — models, harnesses, contexts, environments, and feedback signals — not standalone models.
02Current benchmarks collapse all components into a single end-to-end score with no component-level signal.
03Any single harness component can shift a benchmark score by margins comparable to those between adjacent model generations.

Summary— our read of the original

Gorinova, Baker, and Heineike contend that the benchmarks currently used to compare coding agents were designed before the agentic era and are therefore poorly suited to evaluating modern systems. The central problem is that a coding agent in practice is not a model in isolation — it is a composite of models, harnesses, contexts, environments, and feedback signals. Yet existing benchmarks collapse all of these components into a single end-to-end score, typically computed against one reference solution, providing no component-level signal for iteration. The paper notes that any one of these components can shift a benchmark score by margins comparable to those between adjacent model generations, meaning scores can mislead practitioners about where improvements actually originate.

First, benchmark scores conflate the model with the rest of the harness, making it impossible to attribute performance to any individual component.

The authors identify three specific symptoms of this misalignment. First, benchmark scores conflate the model with the rest of the harness, making it impossible to attribute performance to any individual component. Second, grading against a single reference solution penalizes equally valid alternative solutions. Third, the absence of signal at the level of individual harness components makes the end-to-end score difficult to use as a guide for iterative improvement.

Key facts

01Coding agents are composite systems — models, harnesses, contexts, environments, and feedback signals — not standalone models.
02Current benchmarks collapse all components into a single end-to-end score with no component-level signal.
03Any single harness component can shift a benchmark score by margins comparable to those between adjacent model generations.
04Scores are typically computed against one reference solution, penalizing equally valid alternatives.
05The paper identifies three symptoms: score conflation, single-reference grading, and lack of per-component signal.
06The authors argue the absence of component-level signal makes end-to-end scores difficult to iterate on.

Topics

#benchmarks #agent-framework #agentic-coding #evaluation #research

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →

Jun 16, 2026·1 min readResearch Papers

Coding benchmarks are misaligned with agentic software engineering

A paper by Gorinova, Baker, and Heineike argues that current coding benchmarks were designed in a pre-agent era and fail to evaluate the composite systems that modern coding agents actually are.

ArXiv·Maria I. Gorinova, Macey Baker, Amy Heineike

Read at source

Composite

6.1

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Coding agents are composite systems — models, harnesses, contexts, environments, and feedback signals — not standalone models.
02Current benchmarks collapse all components into a single end-to-end score with no component-level signal.
03Any single harness component can shift a benchmark score by margins comparable to those between adjacent model generations.

Summary— our read of the original

First, benchmark scores conflate the model with the rest of the harness, making it impossible to attribute performance to any individual component.

Key facts

01Coding agents are composite systems — models, harnesses, contexts, environments, and feedback signals — not standalone models.
02Current benchmarks collapse all components into a single end-to-end score with no component-level signal.
03Any single harness component can shift a benchmark score by margins comparable to those between adjacent model generations.
04Scores are typically computed against one reference solution, penalizing equally valid alternatives.
05The paper identifies three symptoms: score conflation, single-reference grading, and lack of per-component signal.
06The authors argue the absence of component-level signal makes end-to-end scores difficult to iterate on.

Topics

#benchmarks #agent-framework #agentic-coding #evaluation #research

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.