Mercury 2 tops OpenClaw benchmark with speed and low cost
Mercury 2 achieved a 78% task success rate on Pinch Bench, OpenClaw's open-source benchmark, outperforming GPT-5 Mini, Gemini 2.5 Flash, and GPT-4 while delivering 1.7-second end-to-end latency at a fraction of competing model prices.
Score breakdown
Teams running OpenClaw as a continuous agent can evaluate Mercury 2 as a drop-in model to dramatically cut latency and cost without sacrificing task accuracy.
- 01Mercury 2 scored a 78% task success rate on Pinch Bench, the open-source benchmark built on top of OpenClaw.
- 02It outperformed GPT-5 Mini (75%), Deep Seek Chat (72%), Gemini 2.5 Flash (71%), and GPT-4 (71%).
- 03Mercury 2's end-to-end latency is 1.7 seconds, compared to 23 seconds for Claude 4.5 Haiku with reasoning.
According to a video by David Ondrej, Mercury 2 has claimed the top spot on Pinch Bench, the open-source benchmark designed specifically for OpenClaw agent tasks. With a 78% task success rate, it outscores GPT-5 Mini at 75%, Deep Seek Chat at 72%, Gemini 2.5 Flash at 71%, and GPT-4 at 71%. Critically, Mercury 2 also holds the fastest end-to-end latency at that accuracy tier — just 1.7 seconds — compared to 23 seconds for Claude 4.5 Haiku with reasoning enabled.
This makes latency compounding across long-running agent workflows significantly less of a problem.
The speed advantage stems from Mercury's diffusion-based architecture, which generates all tokens simultaneously rather than sequentially. This makes latency compounding across long-running agent workflows significantly less of a problem. Pinch Bench tests real-world agentic actions — scheduling meetings, drafting emails, writing code, and managing files — making the results directly relevant to production agent deployments.
On pricing, Mercury 2 costs $0.25 per million input tokens and $0.75 per million output tokens, compared to Claude Haiku at $1 and $5 respectively. The video frames Mercury 2 as a practical solution for running OpenClaw continuously as a personal AI agent, where both latency and cost compound over every task and hour of operation.
Key facts
- 01Mercury 2 scored a 78% task success rate on Pinch Bench, the open-source benchmark built on top of OpenClaw.
- 02It outperformed GPT-5 Mini (75%), Deep Seek Chat (72%), Gemini 2.5 Flash (71%), and GPT-4 (71%).
- 03Mercury 2's end-to-end latency is 1.7 seconds, compared to 23 seconds for Claude 4.5 Haiku with reasoning.
- 04Mercury uses a diffusion architecture, generating all tokens simultaneously rather than one by one.
- 05Pricing is $0.25 per million input tokens and $0.75 per million output tokens, versus Claude Haiku at $1 and $5.
- 06Pinch Bench tests real agentic tasks: scheduling meetings, drafting emails, writing code, and managing files.
- 07OpenClaw is described as the fastest growing open source project in GitHub history.