@swyx highlights new Pareto frontiers across Codex benchmarks
@swyx reports a new Codex release claiming Pareto-frontier performance across context, pricing, speed, and a sweep of benchmarks including 82.7% on Terminal-Bench 2.0 and 98.0% on Tau2-bench Telecom.
Score breakdown
Developers evaluating agentic coding tools should note the combination of a 1M-token API context window, a 20% inference speed gain, and strong scores across coding, bioinformatics, and knowledge-work benchmarks — all at a published price point — making this a concrete new baseline for model selection.
- 01400K token context window in Codex; 1M token context window via the API
- 02API pricing: $5/M input tokens and $30/M output tokens
- 03Codex improved its own inference speed by 20%
@swyx summarized what appears to be a significant new Codex release, framing it as achieving new Pareto frontiers across context length, pricing, inference speed, and benchmark performance. On the infrastructure side, the model is described as the first generation co-designed with GB200 and GB300 NVL72 hardware, and it reportedly improved its own inference speed by 20%. Context windows are substantial: 400K tokens within Codex and 1M tokens via the API, at $5/M input and $30/M output tokens.
On coding and engineering tasks, the model scores 82.7% on Terminal-Bench 2.0, 73.1% on Expert-SWE (described as a new internal eval), and 58.6% on SWE-Bench Pro.
The benchmark sweep covers a wide range of domains. On coding and engineering tasks, the model scores 82.7% on Terminal-Bench 2.0, 73.1% on Expert-SWE (described as a new internal eval), and 58.6% on SWE-Bench Pro. Knowledge work and domain-specific evals are also highlighted: 84.9% on GDPval (with the note "knowledge work ~solved?"), 98.0% on Tau2-bench Telecom for support tasks, and 80.5% on BixBench covering bioinformatics and data analysis.
Key facts
- 01400K token context window in Codex; 1M token context window via the API
- 02API pricing: $5/M input tokens and $30/M output tokens
- 03Codex improved its own inference speed by 20%
- 04First generation co-designed with GB200 and GB300 NVL72 hardware
- 0582.7% on Terminal-Bench 2.0; 73.1% on Expert-SWE (new internal eval); 58.6% on SWE-Bench Pro
- 0684.9% on GDPval; 98.0% on Tau2-bench Telecom; 80.5% on BixBench
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 24, 2026 · 17:11 UTC.