FrontierCode benchmark launches with 3,000+ rubrics for maintainable code
@swyx announces the launch of FrontierCode, a new AI coding benchmark built on 1,000+ hours of maintainer-validated work, featuring 3,000+ code quality rubrics — and notes that METR_Evals found more than half of SWEBench results are unmergeable.
Score breakdown
FrontierCode's launch directly addresses the credibility gap in existing AI coding benchmarks — most notably the finding that over half of SWEBench results are unmergeable — by introducing maintainer-validated rubrics that measure real-world code quality rather than test-passing alone.
- 01METR_Evals found that more than half of SWEBench results are unmergeable.
- 02FrontierCode represents over 1,000 hours of maintainer-validated software engineering work.
- 03The benchmark includes 3,000+ rubrics covering code quality and anti-cheat reward hacking.
@swyx announces the launch of FrontierCode, framing it as the third era of AI coding benchmarks: 2021 brought HumanEval for autocomplete, 2023 brought SWEBench and TerminalBench for passing tests, and 2026 now brings FrontierCode for maintainable code. The benchmark was constructed with IOI Gold medalists and top code maintainers, represents over 1,000 hours of maintainer-validated software engineering work, and includes 3,000+ rubrics designed to measure code quality and guard against the reward-hacking and anti-cheat issues that plague other benchmarks. A key motivation for the launch is a METR_Evals finding that more than half of SWEBench results are unmergeable. The hardest tier, FC Diamond, is described as extremely difficult — Opus 4.8 scores only 13.8% on it.
Looking ahead, @swyx predicts each FrontierCode tier will saturate roughly annually as AI accelerates, and says he has already asked the team to prepare FrontierCode 2027.
A historical run across older models revealed a notable inflection point: the easiest third of FrontierCode tasks (in FC Extended) were rapidly and suddenly solved over late 2025, with one model nearly doubling its pass rate from 41% to 74% in four months. @swyx connects this to a widely noted "vibe shift" in December 2025 — the difference between needing roughly 6 re-rolls versus 2 to achieve 95% success, which he argues made higher-order agentic coding loops (such as those described by @GeoffreyHuntley, @bcherny, and @steipete) more practically viable. Looking ahead, @swyx predicts each FrontierCode tier will saturate roughly annually as AI accelerates, and says he has already asked the team to prepare FrontierCode 2027.
Key facts
- 01METR_Evals found that more than half of SWEBench results are unmergeable.
- 02FrontierCode represents over 1,000 hours of maintainer-validated software engineering work.
- 03The benchmark includes 3,000+ rubrics covering code quality and anti-cheat reward hacking.
- 04FC Diamond, the hardest tier, scores only 13.8% for Opus 4.8.
- 05A historical run showed the easiest third of FC tasks were rapidly solved in late 2025, with one model's pass rate nearly doubling from 41% to 74% in four months.
- 06FrontierCode is framed as the third era of AI coding benchmarks: HumanEval (2021), SWEBench/TerminalBench (2023), FrontierCode (2026).
- 07FrontierCode 2027 is already being prepared.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →