Study finds 28.5% of SWE-bench tasks hackable by incorrect patches
A paper by Shreshth Rajan measures how often code RL training environments accept incorrect solutions as correct, finding that over a quarter of sampled SWE-bench Verified tasks have test suites weak enough to be fooled by a Docker-verified wrong patch.
Score breakdown
Systematic reward hackability at this scale means frontier models trained or evaluated on SWE-bench Verified and R2E-Gym may be earning inflated Pass@1 scores on a measurable fraction of tasks, undermining the reliability of these benchmarks as signals of true coding ability.
- 0128.5% of a 49-task SWE-bench Verified sample have test suites weak enough that an incorrect patch passes them.
- 0225.0% of 20 R2E-Gym tasks across 6 repositories are hackable under the same single-shot exploit generation pipeline.
- 03Meta-analysis of 134 frontier model submissions finds Pass@1 is +14.14 percentage points higher on hackable tasks than robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6).
Shreshth Rajan's paper quantifies reward hackability in code RL training environments by measuring how frequently benchmark test suites accept Docker-verified incorrect patches as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough to pass an incorrect patch; on 20 R2E-Gym tasks across 6 repositories, the same single-shot exploit generation pipeline yields a 25.0% hackability rate.
To assess the downstream impact on model evaluation, the paper conducts a random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified.
To assess the downstream impact on model evaluation, the paper conducts a random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified. Within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6; I^2 = 0%; 123 of 134 models show a positive effect), indicating that inflated scores on hackable tasks are a systematic, not incidental, phenomenon.
The paper then describes a hardening procedure to address broken tasks. An inline LLM judge paired with a Docker gold-sanity gate tests each generated test case against the gold solution before consulting the judge. On the 11 broken tasks in the audit, this gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself — a 61.9% per-augmentation defect rate that the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.
Key facts
- 0128.5% of a 49-task SWE-bench Verified sample have test suites weak enough that an incorrect patch passes them.
- 0225.0% of 20 R2E-Gym tasks across 6 repositories are hackable under the same single-shot exploit generation pipeline.
- 03Meta-analysis of 134 frontier model submissions finds Pass@1 is +14.14 percentage points higher on hackable tasks than robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6).
- 04The effect is consistent: 123 of 134 models show a positive hackability advantage; I^2 = 0%.
- 05A proposed hardening procedure combines an inline LLM judge with a Docker gold-sanity gate.
- 06The gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself — a 61.9% per-augmentation defect rate the LLM judge alone misses.
- 07With diversity-biased retry, the hardening loop converges 9 of 11 broken tasks to a gated upgrade.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →