Coding benchmarks are misaligned with agentic software engineering
A paper by Gorinova, Baker, and Heineike argues that current coding benchmarks were designed in a pre-agent era and fail to evaluate the composite systems that modern coding agents actually are.
Score breakdown
Because any single harness component can move benchmark scores by margins comparable to those between adjacent model generations, end-to-end scores can misattribute performance gains and mislead practitioners trying to improve agentic systems.
- 01Coding agents are composite systems — models, harnesses, contexts, environments, and feedback signals — not standalone models.
- 02Current benchmarks collapse all components into a single end-to-end score with no component-level signal.
- 03Any single harness component can shift a benchmark score by margins comparable to those between adjacent model generations.
Gorinova, Baker, and Heineike contend that the benchmarks currently used to compare coding agents were designed before the agentic era and are therefore poorly suited to evaluating modern systems. The central problem is that a coding agent in practice is not a model in isolation — it is a composite of models, harnesses, contexts, environments, and feedback signals. Yet existing benchmarks collapse all of these components into a single end-to-end score, typically computed against one reference solution, providing no component-level signal for iteration. The paper notes that any one of these components can shift a benchmark score by margins comparable to those between adjacent model generations, meaning scores can mislead practitioners about where improvements actually originate.
First, benchmark scores conflate the model with the rest of the harness, making it impossible to attribute performance to any individual component.
The authors identify three specific symptoms of this misalignment. First, benchmark scores conflate the model with the rest of the harness, making it impossible to attribute performance to any individual component. Second, grading against a single reference solution penalizes equally valid alternative solutions. Third, the absence of signal at the level of individual harness components makes the end-to-end score difficult to use as a guide for iterative improvement.
Key facts
- 01Coding agents are composite systems — models, harnesses, contexts, environments, and feedback signals — not standalone models.
- 02Current benchmarks collapse all components into a single end-to-end score with no component-level signal.
- 03Any single harness component can shift a benchmark score by margins comparable to those between adjacent model generations.
- 04Scores are typically computed against one reference solution, penalizing equally valid alternatives.
- 05The paper identifies three symptoms: score conflation, single-reference grading, and lack of per-component signal.
- 06The authors argue the absence of component-level signal makes end-to-end scores difficult to iterate on.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →