Systematic reward hackability at this scale means frontier models trained or evaluated on SWE-bench Verified and R2E-Gym may be earning inflated Pass@1 scores on a measurable fraction of tasks, undermining the reliability of these benchmarks as signals of true coding ability.