Teams evaluating AI coding tools should benchmark agent frameworks head-to-head on the same model rather than comparing models across frameworks, since scaffolding improvements can move performance by twenty or more points while model upgrades at the frontier yield roughly one.