Jun 9, 2026·1 min readResearch Papers

Cognition's FrontierCode benchmark tests code mergeability, not just test passage

Cognition's FrontierCode benchmark evaluates whether AI-generated code would actually be merged by a real maintainer, using blocker criteria and weighted rubrics across 150 tasks drawn from 36 open-source repositories.

YouTube: AICodeKing·AICodeKing

Read at source

Composite

7.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

FrontierCode represents a stricter standard for evaluating AI coding agents by requiring production-quality, review-ready code rather than just functional correctness — and the low scores even from leading models show the benchmark is far from saturated.

01FrontierCode measures code mergeability — whether a real maintainer would merge a pull request — not just whether code passes tests.
02The benchmark has three nested subsets: Extended (150 tasks), Main (100 hardest tasks), and Diamond (50 hardest tasks).
03Pass rate is binary and requires clearing every blocker criterion; score is a weighted rubric aggregate that becomes zero if any blocker fails.

Summary— our read of the original

Cognition's FrontierCode benchmark reframes how AI coding performance is measured by asking not just whether generated code passes tests, but whether a real maintainer would merge the resulting pull request. The benchmark penalizes patches that are overly broad, touch unrelated files, add weak tests, ignore local style conventions, or solve an immediate issue in ways that complicate future changes — all common reasons for rejection in real code review. Tasks are built with maintainers from 36 flagship open-source repositories and cover a broader mix of programming languages than older benchmarks like DeepSWE and SWE-Bench Pro.

GPT-5.5 scores 6.3% (pass rate 7.2%), Claude Opus 4.7 scores 5.2%, Gemini 3.1 Pro scores 4.7%, GPT-5.4 Mini scores 4.6%, and Kimi K 2.6 scores 3.8%.

FrontierCode uses two primary metrics: pass rate, which is binary and requires clearing every blocker criterion (hard stops a maintainer would enforce), and score, a weighted aggregate of rubric items that becomes zero if any blocker criterion fails. The benchmark has three nested subsets — Extended (150 tasks), Main (100 hardest), and Diamond (50 hardest) — and each model is run five times at every available reasoning effort level, with the headline chart reporting the best-performing effort setting per model.

On the Diamond subset, Claude Opus 4.8 leads with a score of 13.4% and a pass rate of 14.5%. GPT-5.5 scores 6.3% (pass rate 7.2%), Claude Opus 4.7 scores 5.2%, Gemini 3.1 Pro scores 4.7%, GPT-5.4 Mini scores 4.6%, and Kimi K 2.6 scores 3.8%. On the Main subset, Claude Opus 4.8 scores 34.3% with a 37.3% pass rate, while GPT-5.5 scores 25.5% with a 28.2% pass rate. The results chart also allows comparison of score and pass rate against output tokens, cost, time, tool calls, agent steps, and output token equivalent — reflecting that benchmark performance is not only about accuracy but also about the resources required to achieve it. Cognition also reports that FrontierCode has lower false positive rates than SWE-Bench Pro in their analysis.

Key facts

01FrontierCode measures code mergeability — whether a real maintainer would merge a pull request — not just whether code passes tests.
02The benchmark has three nested subsets: Extended (150 tasks), Main (100 hardest tasks), and Diamond (50 hardest tasks).
03Pass rate is binary and requires clearing every blocker criterion; score is a weighted rubric aggregate that becomes zero if any blocker fails.
04Each model is run five times at every available reasoning effort level; the headline chart shows the best-performing effort setting per model.
05On Diamond, Claude Opus 4.8 leads with a score of 13.4% and a pass rate of 14.5%; GPT-5.5 scores 6.3% and Claude Opus 4.7 scores 5.2%.
06On Main, Claude Opus 4.8 scores 34.3% (37.3% pass rate); GPT-5.5 scores 25.5% (28.2% pass rate).
07Tasks are built with maintainers from 36 flagship open-source repositories; Cognition reports lower false positive rates than SWE-Bench Pro.

Topics

#benchmarks #code-generation #agentic-coding #model-release

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →

Jun 9, 2026·1 min readResearch Papers

Cognition's FrontierCode benchmark tests code mergeability, not just test passage

YouTube: AICodeKing·AICodeKing

Read at source

Composite

7.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01FrontierCode measures code mergeability — whether a real maintainer would merge a pull request — not just whether code passes tests.
02The benchmark has three nested subsets: Extended (150 tasks), Main (100 hardest tasks), and Diamond (50 hardest tasks).
03Pass rate is binary and requires clearing every blocker criterion; score is a weighted rubric aggregate that becomes zero if any blocker fails.

Summary— our read of the original

GPT-5.5 scores 6.3% (pass rate 7.2%), Claude Opus 4.7 scores 5.2%, Gemini 3.1 Pro scores 4.7%, GPT-5.4 Mini scores 4.6%, and Kimi K 2.6 scores 3.8%.

Key facts

01FrontierCode measures code mergeability — whether a real maintainer would merge a pull request — not just whether code passes tests.
02The benchmark has three nested subsets: Extended (150 tasks), Main (100 hardest tasks), and Diamond (50 hardest tasks).
03Pass rate is binary and requires clearing every blocker criterion; score is a weighted rubric aggregate that becomes zero if any blocker fails.
04Each model is run five times at every available reasoning effort level; the headline chart shows the best-performing effort setting per model.
05On Diamond, Claude Opus 4.8 leads with a score of 13.4% and a pass rate of 14.5%; GPT-5.5 scores 6.3% and Claude Opus 4.7 scores 5.2%.
06On Main, Claude Opus 4.8 scores 34.3% (37.3% pass rate); GPT-5.5 scores 25.5% (28.2% pass rate).
07Tasks are built with maintainers from 36 flagship open-source repositories; Cognition reports lower false positive rates than SWE-Bench Pro.

Topics

#benchmarks #code-generation #agentic-coding #model-release

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.