Archive·1 story·Jun 2026 – Jun 2026·Updated 22:51 UTC

Archive

Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.

1 storyShowing 1–1Page 1 of 1

Sort

NewestScore

Density

W241 story · Jun 8–14

6.7
Jun 14, 2026·Shreshth Rajan·Research Papers·1 min read
Study finds 28.5% of SWE-bench tasks hackable by incorrect patches
Systematic reward hackability at this scale means frontier models trained or evaluated on SWE-bench Verified and R2E-Gym may be earning inflated Pass@1 scores on a measurable fraction of tasks, undermining the reliability of these benchmarks as signals of true coding ability.
Read at source ↗

Study finds 28.5% of SWE-bench tasks hackable by incorrect patches