Apr 15, 2026·1 min readNew Models & Releases

Mercury 2 tops OpenClaw benchmark with speed and low cost

Mercury 2 achieved a 78% task success rate on Pinch Bench, OpenClaw's open-source benchmark, outperforming GPT-5 Mini, Gemini 2.5 Flash, and GPT-4 while delivering 1.7-second end-to-end latency at a fraction of competing model prices.

YouTube: David Ondrej·David Ondrej

Read at source

Composite

5.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams running OpenClaw as a continuous agent can evaluate Mercury 2 as a drop-in model to dramatically cut latency and cost without sacrificing task accuracy.

01Mercury 2 scored a 78% task success rate on Pinch Bench, the open-source benchmark built on top of OpenClaw.
02It outperformed GPT-5 Mini (75%), Deep Seek Chat (72%), Gemini 2.5 Flash (71%), and GPT-4 (71%).
03Mercury 2's end-to-end latency is 1.7 seconds, compared to 23 seconds for Claude 4.5 Haiku with reasoning.

Summary— our read of the original

According to a video by David Ondrej, Mercury 2 has claimed the top spot on Pinch Bench, the open-source benchmark designed specifically for OpenClaw agent tasks. With a 78% task success rate, it outscores GPT-5 Mini at 75%, Deep Seek Chat at 72%, Gemini 2.5 Flash at 71%, and GPT-4 at 71%. Critically, Mercury 2 also holds the fastest end-to-end latency at that accuracy tier — just 1.7 seconds — compared to 23 seconds for Claude 4.5 Haiku with reasoning enabled.

This makes latency compounding across long-running agent workflows significantly less of a problem.

The speed advantage stems from Mercury's diffusion-based architecture, which generates all tokens simultaneously rather than sequentially. This makes latency compounding across long-running agent workflows significantly less of a problem. Pinch Bench tests real-world agentic actions — scheduling meetings, drafting emails, writing code, and managing files — making the results directly relevant to production agent deployments.

On pricing, Mercury 2 costs $0.25 per million input tokens and $0.75 per million output tokens, compared to Claude Haiku at $1 and $5 respectively. The video frames Mercury 2 as a practical solution for running OpenClaw continuously as a personal AI agent, where both latency and cost compound over every task and hour of operation.

Key facts

01Mercury 2 scored a 78% task success rate on Pinch Bench, the open-source benchmark built on top of OpenClaw.
02It outperformed GPT-5 Mini (75%), Deep Seek Chat (72%), Gemini 2.5 Flash (71%), and GPT-4 (71%).
03Mercury 2's end-to-end latency is 1.7 seconds, compared to 23 seconds for Claude 4.5 Haiku with reasoning.
04Mercury uses a diffusion architecture, generating all tokens simultaneously rather than one by one.
05Pricing is $0.25 per million input tokens and $0.75 per million output tokens, versus Claude Haiku at $1 and $5.
06Pinch Bench tests real agentic tasks: scheduling meetings, drafting emails, writing code, and managing files.
07OpenClaw is described as the fastest growing open source project in GitHub history.

Topics

#model-release #benchmarks #agent-framework #open-source #coding-assistant

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 19, 2026 · 23:06 UTC. How this works →

Apr 15, 2026·1 min readNew Models & Releases

Mercury 2 tops OpenClaw benchmark with speed and low cost

YouTube: David Ondrej·David Ondrej

Read at source

Composite

5.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams running OpenClaw as a continuous agent can evaluate Mercury 2 as a drop-in model to dramatically cut latency and cost without sacrificing task accuracy.

01Mercury 2 scored a 78% task success rate on Pinch Bench, the open-source benchmark built on top of OpenClaw.
02It outperformed GPT-5 Mini (75%), Deep Seek Chat (72%), Gemini 2.5 Flash (71%), and GPT-4 (71%).
03Mercury 2's end-to-end latency is 1.7 seconds, compared to 23 seconds for Claude 4.5 Haiku with reasoning.

Summary— our read of the original

This makes latency compounding across long-running agent workflows significantly less of a problem.

Key facts

01Mercury 2 scored a 78% task success rate on Pinch Bench, the open-source benchmark built on top of OpenClaw.
02It outperformed GPT-5 Mini (75%), Deep Seek Chat (72%), Gemini 2.5 Flash (71%), and GPT-4 (71%).
03Mercury 2's end-to-end latency is 1.7 seconds, compared to 23 seconds for Claude 4.5 Haiku with reasoning.
04Mercury uses a diffusion architecture, generating all tokens simultaneously rather than one by one.
05Pricing is $0.25 per million input tokens and $0.75 per million output tokens, versus Claude Haiku at $1 and $5.
06Pinch Bench tests real agentic tasks: scheduling meetings, drafting emails, writing code, and managing files.
07OpenClaw is described as the fastest growing open source project in GitHub history.

Topics

#model-release #benchmarks #agent-framework #open-source #coding-assistant

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics