Cognition's FrontierCode benchmark tests code mergeability, not just test passage
Cognition's FrontierCode benchmark evaluates whether AI-generated code would actually be merged by a real maintainer, using blocker criteria and weighted rubrics across 150 tasks drawn from 36 open-source repositories.
Score breakdown
FrontierCode represents a stricter standard for evaluating AI coding agents by requiring production-quality, review-ready code rather than just functional correctness — and the low scores even from leading models show the benchmark is far from saturated.
- 01FrontierCode measures code mergeability — whether a real maintainer would merge a pull request — not just whether code passes tests.
- 02The benchmark has three nested subsets: Extended (150 tasks), Main (100 hardest tasks), and Diamond (50 hardest tasks).
- 03Pass rate is binary and requires clearing every blocker criterion; score is a weighted rubric aggregate that becomes zero if any blocker fails.
Cognition's FrontierCode benchmark reframes how AI coding performance is measured by asking not just whether generated code passes tests, but whether a real maintainer would merge the resulting pull request. The benchmark penalizes patches that are overly broad, touch unrelated files, add weak tests, ignore local style conventions, or solve an immediate issue in ways that complicate future changes — all common reasons for rejection in real code review. Tasks are built with maintainers from 36 flagship open-source repositories and cover a broader mix of programming languages than older benchmarks like DeepSWE and SWE-Bench Pro.
GPT-5.5 scores 6.3% (pass rate 7.2%), Claude Opus 4.7 scores 5.2%, Gemini 3.1 Pro scores 4.7%, GPT-5.4 Mini scores 4.6%, and Kimi K 2.6 scores 3.8%.
FrontierCode uses two primary metrics: pass rate, which is binary and requires clearing every blocker criterion (hard stops a maintainer would enforce), and score, a weighted aggregate of rubric items that becomes zero if any blocker criterion fails. The benchmark has three nested subsets — Extended (150 tasks), Main (100 hardest), and Diamond (50 hardest) — and each model is run five times at every available reasoning effort level, with the headline chart reporting the best-performing effort setting per model.
On the Diamond subset, Claude Opus 4.8 leads with a score of 13.4% and a pass rate of 14.5%. GPT-5.5 scores 6.3% (pass rate 7.2%), Claude Opus 4.7 scores 5.2%, Gemini 3.1 Pro scores 4.7%, GPT-5.4 Mini scores 4.6%, and Kimi K 2.6 scores 3.8%. On the Main subset, Claude Opus 4.8 scores 34.3% with a 37.3% pass rate, while GPT-5.5 scores 25.5% with a 28.2% pass rate. The results chart also allows comparison of score and pass rate against output tokens, cost, time, tool calls, agent steps, and output token equivalent — reflecting that benchmark performance is not only about accuracy but also about the resources required to achieve it. Cognition also reports that FrontierCode has lower false positive rates than SWE-Bench Pro in their analysis.
Key facts
- 01FrontierCode measures code mergeability — whether a real maintainer would merge a pull request — not just whether code passes tests.
- 02The benchmark has three nested subsets: Extended (150 tasks), Main (100 hardest tasks), and Diamond (50 hardest tasks).
- 03Pass rate is binary and requires clearing every blocker criterion; score is a weighted rubric aggregate that becomes zero if any blocker fails.
- 04Each model is run five times at every available reasoning effort level; the headline chart shows the best-performing effort setting per model.
- 05On Diamond, Claude Opus 4.8 leads with a score of 13.4% and a pass rate of 14.5%; GPT-5.5 scores 6.3% and Claude Opus 4.7 scores 5.2%.
- 06On Main, Claude Opus 4.8 scores 34.3% (37.3% pass rate); GPT-5.5 scores 25.5% (28.2% pass rate).
- 07Tasks are built with maintainers from 36 flagship open-source repositories; Cognition reports lower false positive rates than SWE-Bench Pro.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →