Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
A benchmark built from private production code addresses the contamination risk present in public benchmarks like SWE-Bench, where training data overlap can inflate model scores.
The data shows that a value-tier model — DeepSeek V4 — cleared the production quality bar for the first time at its price point, reshaping token volume distribution in a single month, while frontier spend continued to grow, illustrating a market splitting into distinct cost tiers rather than converging on one.