Apr 19, 2026·2 min readOpinion & Analysis

Agent scaffolding, not model choice, drives coding benchmark performance

Michael Tuszynski argues that agent scaffolding — tool definitions, context management, error recovery, and feedback loops — explains far more performance variance than model selection, with the same model showing a 9.5-point spread across frameworks on SWE-bench Pro.

Dev.to #llm·Michael Tuszynski

Read at source

Composite

6.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams evaluating AI coding tools should benchmark agent frameworks head-to-head on the same model rather than comparing models across frameworks, since scaffolding improvements can move performance by twenty or more points while model upgrades at the frontier yield roughly one.

01Six frontier coding models score within 0.8 points of each other on SWE-bench Verified, while the same Claude Opus 4.5 model spans 45.9% to 55.4% across four agent frameworks on SWE-bench Pro — a 9.5-point spread.
02Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7% on SWE-bench Pro, beating Opus 4.5 on Anthropic's own framework at 52.0% — a cheaper model with better scaffolding outperformed the flagship.
03Grok Code Fast jumped from 6.7% to 68.3% on coding benchmarks by changing only the edit tool format, with no model or prompt changes.

Summary— our read of the original

Michael Tuszynski opens with a striking benchmark observation: six frontier coding models now score within 0.8 points of each other on SWE-bench Verified, while the same model wrapped in different agent frameworks swings nearly ten points on SWE-bench Pro. He frames this using Birgitta Böckeler's formulation from martinfowler.com — "Agent = Model + Scaffolding" — where scaffolding encompasses tool definitions, context compaction, error recovery logic, feedback sensors, system prompts, and inter-session memory.

More pointedly, Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7%, beating Opus 4.5 on Anthropic's own stock framework at 52.0%.

Particula Tech's analysis, cited by Tuszynski, ran four frameworks using identical Claude Opus 4.5 weights against SWE-bench Pro, producing scores from 45.9% (SEAL) to 55.4% (Claude Code) — a 9.5-point spread from scaffolding alone. More pointedly, Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7%, beating Opus 4.5 on Anthropic's own stock framework at 52.0%. Single-variable experiments reinforce the pattern: Grok Code Fast went from 6.7% to 68.3% by changing only the edit tool format; LangChain's agent moved from 52.8% to 66.5% on Terminal Bench 2.0 via improved task decomposition; and adding WarpGrep as a specialized search subagent added 2.1 to 3.7 points across every model tested while cutting cost 15.6% and runtime 28%.

Tuszynski also highlights published scaffolding work from both major labs. Anthropic's "Effective harnesses for long-running agents" describes a two-agent pattern — an initializer that writes structured JSON, an `init.sh` script, and a progress file, and a coding agent that reads that file, commits one feature at a time, and updates progress — designed to prevent context-loss failures mid-implementation. OpenAI's writeup describes three engineers shipping a million lines of code across 1,500 PRs in five months with no manually-written code, by designing the environment: wiring Chrome DevTools into the agent runtime, exposing logs via LogQL and metrics via PromQL, and making the app bootable per git worktree. The article concludes that model upgrades at the frontier buy roughly one benchmark point, while scaffolding improvements can buy twenty or more, and that a well-scaffolded Sonnet beats a poorly-scaffolded Opus at about a fifth the cost.

Key facts

01Six frontier coding models score within 0.8 points of each other on SWE-bench Verified, while the same Claude Opus 4.5 model spans 45.9% to 55.4% across four agent frameworks on SWE-bench Pro — a 9.5-point spread.
02Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7% on SWE-bench Pro, beating Opus 4.5 on Anthropic's own framework at 52.0% — a cheaper model with better scaffolding outperformed the flagship.
03Grok Code Fast jumped from 6.7% to 68.3% on coding benchmarks by changing only the edit tool format, with no model or prompt changes.
04LangChain's coding agent improved from 52.8% to 66.5% on Terminal Bench 2.0 through better task decomposition and tool use, with no model swap.
05Adding WarpGrep as a specialized search subagent added 2.1 to 3.7 points across every model tested while cutting cost 15.6% and runtime 28%.
06OpenAI's agent-first engineering writeup describes three engineers shipping a million lines of code across 1,500 PRs in five months with no manually-written code, by designing the agent environment rather than the code itself.
07The article identifies key scaffolding components with measurable impact: tool orchestration, context management, error recovery, deterministic feedback sensors, and planning-execution separation.

Topics

#agent-framework #benchmarks #prompt-engineering #tool-use #reasoning

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 19, 2026 · 23:48 UTC. How this works →

Agent scaffolding, not model choice, drives coding benchmark performance

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics