Apr 23, 2026·1 min readNew Models & Releases

@swyx highlights new Pareto frontiers across Codex benchmarks

@swyx reports a new Codex release claiming Pareto-frontier performance across context, pricing, speed, and a sweep of benchmarks including 82.7% on Terminal-Bench 2.0 and 98.0% on Tau2-bench Telecom.

Twitter: @swyx·@swyx

Read at source

Composite

7.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers evaluating agentic coding tools should note the combination of a 1M-token API context window, a 20% inference speed gain, and strong scores across coding, bioinformatics, and knowledge-work benchmarks — all at a published price point — making this a concrete new baseline for model selection.

01400K token context window in Codex; 1M token context window via the API
02API pricing: $5/M input tokens and $30/M output tokens
03Codex improved its own inference speed by 20%

Summary— our read of the original

@swyx summarized what appears to be a significant new Codex release, framing it as achieving new Pareto frontiers across context length, pricing, inference speed, and benchmark performance. On the infrastructure side, the model is described as the first generation co-designed with GB200 and GB300 NVL72 hardware, and it reportedly improved its own inference speed by 20%. Context windows are substantial: 400K tokens within Codex and 1M tokens via the API, at $5/M input and $30/M output tokens.

On coding and engineering tasks, the model scores 82.7% on Terminal-Bench 2.0, 73.1% on Expert-SWE (described as a new internal eval), and 58.6% on SWE-Bench Pro.

The benchmark sweep covers a wide range of domains. On coding and engineering tasks, the model scores 82.7% on Terminal-Bench 2.0, 73.1% on Expert-SWE (described as a new internal eval), and 58.6% on SWE-Bench Pro. Knowledge work and domain-specific evals are also highlighted: 84.9% on GDPval (with the note "knowledge work ~solved?"), 98.0% on Tau2-bench Telecom for support tasks, and 80.5% on BixBench covering bioinformatics and data analysis.

Key facts

01400K token context window in Codex; 1M token context window via the API
02API pricing: $5/M input tokens and $30/M output tokens
03Codex improved its own inference speed by 20%
04First generation co-designed with GB200 and GB300 NVL72 hardware
0582.7% on Terminal-Bench 2.0; 73.1% on Expert-SWE (new internal eval); 58.6% on SWE-Bench Pro
0684.9% on GDPval; 98.0% on Tau2-bench Telecom; 80.5% on BixBench

Topics

#model-release #benchmarks #code-generation #context-window

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 24, 2026 · 17:11 UTC. How this works →

Apr 23, 2026·1 min readNew Models & Releases

@swyx highlights new Pareto frontiers across Codex benchmarks

@swyx reports a new Codex release claiming Pareto-frontier performance across context, pricing, speed, and a sweep of benchmarks including 82.7% on Terminal-Bench 2.0 and 98.0% on Tau2-bench Telecom.

Twitter: @swyx·@swyx

Read at source

Composite

7.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01400K token context window in Codex; 1M token context window via the API
02API pricing: $5/M input tokens and $30/M output tokens
03Codex improved its own inference speed by 20%

Summary— our read of the original

On coding and engineering tasks, the model scores 82.7% on Terminal-Bench 2.0, 73.1% on Expert-SWE (described as a new internal eval), and 58.6% on SWE-Bench Pro.

Key facts

01400K token context window in Codex; 1M token context window via the API
02API pricing: $5/M input tokens and $30/M output tokens
03Codex improved its own inference speed by 20%
04First generation co-designed with GB200 and GB300 NVL72 hardware
0582.7% on Terminal-Bench 2.0; 73.1% on Expert-SWE (new internal eval); 58.6% on SWE-Bench Pro
0684.9% on GDPval; 98.0% on Tau2-bench Telecom; 80.5% on BixBench

Topics

#model-release #benchmarks #code-generation #context-window

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics