Cognition Labs launches FrontierCode to benchmark code quality, not just correctness
Cognition Labs introduced FrontierCode, a new coding benchmark built by 20+ open-source maintainers that measures whether AI-generated code is actually mergeable into production codebases — not just functionally correct.
Score breakdown
FrontierCode exposes a large gap between what current AI models can produce and what open-source maintainers would actually accept, with even the top-ranked model scoring only 13.4% on the hardest subset — a concrete signal that existing benchmarks have been overstating model readiness for production codebases.
- 01FrontierCode is described as the first benchmark to measure code mergeability, not just functional correctness.
- 0220+ open-source maintainers built tasks from repos they maintain, spending more than 40 hours per task.
- 03Every task was manually reviewed by a Cognition researcher.
Cognition Labs introduced FrontierCode, a coding benchmark that shifts the evaluation standard from functional correctness to production-grade code quality. The benchmark is built around a central question: would a real open-source maintainer actually merge this pull request? To answer it, Cognition enlisted 20+ world-class open-source developers who each spent more than 40 hours constructing realistic, challenging tasks drawn from repositories they personally maintain. Every task was also manually reviewed by a Cognition researcher. The grading methodology combines unit tests, rubrics, and novel verifiers to assess end-to-end quality across correctness, test quality, scope discipline, style, and adherence to codebase standards.
FrontierCode is structured as three nested subsets of increasing difficulty: Extended (150 tasks), Main (100 tasks), and Diamond (50 tasks).
FrontierCode is structured as three nested subsets of increasing difficulty: Extended (150 tasks), Main (100 tasks), and Diamond (50 tasks). Models are scored on both pass rate and a weighted rubric score, with solutions that fail any blocking criteria receiving a score of 0. Results show that even frontier models struggle significantly: Claude Opus 4.8 leads on the Diamond subset with just 13.4%, followed by GPT-5.5 at 6.3% and Gemini 3.1 Pro at 4.7%. Notably, GPT-5.5 uses up to 4x fewer tokens than Opus 4.8, giving it a better cost-to-performance tradeoff. On Main and Extended, Opus 4.8 scores 34.3% and 51.8% respectively. The best open-source model, Kimi K2.6, reaches only 3.8% on Diamond, 16% on Main, and 37% on Extended. The benchmark also claims an 81% lower false positive rate than SWE-Bench Pro, addressing a known problem where high-scoring models on existing benchmarks produce patches that human maintainers would reject.
Key facts
- 01FrontierCode is described as the first benchmark to measure code mergeability, not just functional correctness.
- 0220+ open-source maintainers built tasks from repos they maintain, spending more than 40 hours per task.
- 03Every task was manually reviewed by a Cognition researcher.
- 04FrontierCode claims an 81% lower false positive rate compared to SWE-Bench Pro.
- 05The benchmark has three nested subsets: Extended (150 tasks), Main (100 tasks), and Diamond (50 tasks).
- 06Claude Opus 4.8 leads on Diamond with a score of 13.4%; GPT-5.5 scores 6.3%; Gemini 3.1 Pro scores 4.7%.
- 07Kimi K2.6 is the best-performing open-source model, scoring 3.8% on Diamond, 16% on Main, and 37% on Extended.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →