Apr 22, 2026·1 min readResearch Papers

OCR benchmark of 18 LLMs finds cheaper models beat expensive ones

ArbitrAI's OCR Mini-Bench ran 7,560 calls across 18 LLMs on 42 real business documents and found that budget and mid-tier models like Gemini 3 Flash and Claude Sonnet 4.6 match or outperform expensive SOTA models on structured document extraction tasks.

Hacker News·TimoKerr

Read at source

Composite

6.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams building production OCR pipelines can use this benchmark to avoid overpaying for SOTA models — Gemini 3 Flash matches top-tier accuracy at a fraction of the cost, and the `pass^n` consistency metric helps identify models that are reliable enough for automated workflows.

0142 business documents, 18 models, and 7,560 total runs were used in the benchmark
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success rate
03Gemini 3 Flash costs 0.67¢ per success; Claude Sonnet 4.6 costs 3.61¢ per success for the same score

Summary— our read of the original

ArbitrAI's OCR Mini-Bench is a business-focused benchmark designed to evaluate LLM-based document extraction on real operational documents. Across 42 documents, 18 models, and 7,560 total runs, it measures critical-field success, all-field accuracy, `pass^3` and `pass^5` consistency scores, latency, and cost per successful outcome — metrics chosen to reflect production realities rather than one-shot performance.

Claude Opus 4.6 ranks third at 73.0% success for 6.25¢/success.

In the overall leaderboard, Gemini 3 Flash (Google, BALANCED tier) and Claude Sonnet 4.6 (Anthropic, BALANCED tier) share the top spot at 73.8% success with perfect `pass^3` and `pass^5` consistency, but Gemini 3 Flash costs just 0.67¢ per success at 16.0s latency versus Claude Sonnet 4.6's 3.61¢ at 18.9s. Claude Opus 4.6 ranks third at 73.0% success for 6.25¢/success. By contrast, GPT-5 ranks 13th at 44.6% success and a steep 24.20¢ per success, and GPT-5.4 ranks 10th at 49.2% success for 7.82¢/success — both far behind on cost-efficiency.

Budget models also show competitive results in specific categories. On receipts, Gemini 2.5 Flash-Lite achieves 53.9% success at just 0.06¢/success. On invoices, Gemini 3 Flash tops the chart at 91.7% success. The `pass^n` metric reveals meaningful reliability gaps: GPT-5.4's overall success drops from 49.2% to 35.9% at `pass^5`, indicating high run-to-run variance, while several Gemini models maintain identical scores across `pass^3` and `pass^5`, signaling strong consistency.

Key facts

0142 business documents, 18 models, and 7,560 total runs were used in the benchmark
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success rate
03Gemini 3 Flash costs 0.67¢ per success; Claude Sonnet 4.6 costs 3.61¢ per success for the same score
04GPT-5 ranked 13th overall at 44.6% success and 24.20¢ per success — the highest cost/success in the leaderboard
05The `pass^n` metric measures probability of n consecutive successes, exposing consistency issues hidden by single-run scores
06GPT-5.4's success rate drops from 49.2% to 35.9% at `pass^5`, indicating high variance
07Gemini 2.5 Flash-Lite achieves 0.06¢ per success on receipts, the lowest cost in that category

Topics

#benchmarks #ocr #llm-evaluation #cost-analysis #model-comparison

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 11:07 UTC. How this works →

Apr 22, 2026·1 min readResearch Papers

OCR benchmark of 18 LLMs finds cheaper models beat expensive ones

Hacker News·TimoKerr

Read at source

Composite

6.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

0142 business documents, 18 models, and 7,560 total runs were used in the benchmark
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success rate
03Gemini 3 Flash costs 0.67¢ per success; Claude Sonnet 4.6 costs 3.61¢ per success for the same score

Summary— our read of the original

Claude Opus 4.6 ranks third at 73.0% success for 6.25¢/success.

Key facts

0142 business documents, 18 models, and 7,560 total runs were used in the benchmark
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success rate
03Gemini 3 Flash costs 0.67¢ per success; Claude Sonnet 4.6 costs 3.61¢ per success for the same score
04GPT-5 ranked 13th overall at 44.6% success and 24.20¢ per success — the highest cost/success in the leaderboard
05The `pass^n` metric measures probability of n consecutive successes, exposing consistency issues hidden by single-run scores
06GPT-5.4's success rate drops from 49.2% to 35.9% at `pass^5`, indicating high variance
07Gemini 2.5 Flash-Lite achieves 0.06¢ per success on receipts, the lowest cost in that category

Topics

#benchmarks #ocr #llm-evaluation #cost-analysis #model-comparison

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics