Agent scaffolding, not model choice, drives coding benchmark performance
Michael Tuszynski argues that agent scaffolding — tool definitions, context management, error recovery, and feedback loops — explains far more performance variance than model selection, with the same model showing a 9.5-point spread across frameworks on SWE-bench Pro.
Score breakdown
Teams evaluating AI coding tools should benchmark agent frameworks head-to-head on the same model rather than comparing models across frameworks, since scaffolding improvements can move performance by twenty or more points while model upgrades at the frontier yield roughly one.
- 01Six frontier coding models score within 0.8 points of each other on SWE-bench Verified, while the same Claude Opus 4.5 model spans 45.9% to 55.4% across four agent frameworks on SWE-bench Pro — a 9.5-point spread.
- 02Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7% on SWE-bench Pro, beating Opus 4.5 on Anthropic's own framework at 52.0% — a cheaper model with better scaffolding outperformed the flagship.
- 03Grok Code Fast jumped from 6.7% to 68.3% on coding benchmarks by changing only the edit tool format, with no model or prompt changes.
Michael Tuszynski opens with a striking benchmark observation: six frontier coding models now score within 0.8 points of each other on SWE-bench Verified, while the same model wrapped in different agent frameworks swings nearly ten points on SWE-bench Pro. He frames this using Birgitta Böckeler's formulation from martinfowler.com — "Agent = Model + Scaffolding" — where scaffolding encompasses tool definitions, context compaction, error recovery logic, feedback sensors, system prompts, and inter-session memory.
More pointedly, Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7%, beating Opus 4.5 on Anthropic's own stock framework at 52.0%.
Particula Tech's analysis, cited by Tuszynski, ran four frameworks using identical Claude Opus 4.5 weights against SWE-bench Pro, producing scores from 45.9% (SEAL) to 55.4% (Claude Code) — a 9.5-point spread from scaffolding alone. More pointedly, Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7%, beating Opus 4.5 on Anthropic's own stock framework at 52.0%. Single-variable experiments reinforce the pattern: Grok Code Fast went from 6.7% to 68.3% by changing only the edit tool format; LangChain's agent moved from 52.8% to 66.5% on Terminal Bench 2.0 via improved task decomposition; and adding WarpGrep as a specialized search subagent added 2.1 to 3.7 points across every model tested while cutting cost 15.6% and runtime 28%.
Tuszynski also highlights published scaffolding work from both major labs. Anthropic's "Effective harnesses for long-running agents" describes a two-agent pattern — an initializer that writes structured JSON, an `init.sh` script, and a progress file, and a coding agent that reads that file, commits one feature at a time, and updates progress — designed to prevent context-loss failures mid-implementation. OpenAI's writeup describes three engineers shipping a million lines of code across 1,500 PRs in five months with no manually-written code, by designing the environment: wiring Chrome DevTools into the agent runtime, exposing logs via LogQL and metrics via PromQL, and making the app bootable per git worktree. The article concludes that model upgrades at the frontier buy roughly one benchmark point, while scaffolding improvements can buy twenty or more, and that a well-scaffolded Sonnet beats a poorly-scaffolded Opus at about a fifth the cost.
Key facts
- 01Six frontier coding models score within 0.8 points of each other on SWE-bench Verified, while the same Claude Opus 4.5 model spans 45.9% to 55.4% across four agent frameworks on SWE-bench Pro — a 9.5-point spread.
- 02Meta and Harvard's Confucius Code Agent running Sonnet 4.5 scored 52.7% on SWE-bench Pro, beating Opus 4.5 on Anthropic's own framework at 52.0% — a cheaper model with better scaffolding outperformed the flagship.