Apr 22, 2026·1 min readResearch Papers

Cheaper LLMs beat pricier rivals in business OCR benchmark

ArbitrAI's OCR Mini-Bench tested 18 LLMs across 7,560 runs on 42 real business documents, finding that budget and mid-range models like Gemini 3 Flash and Claude Sonnet 4.6 outperform far more expensive options on cost-adjusted OCR quality.

Hacker News·TimoKerr

Read at source

Composite

6.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams building production document-processing pipelines should evaluate cost-per-success and consistency metrics like `pass^5` rather than peak accuracy alone, as this benchmark shows budget and mid-range models can dramatically outperform expensive SOTA models on real business OCR tasks.

0118 LLMs were tested across 42 business documents in 7,560 total runs, covering receipts, invoices, and logistics categories.
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success, but Gemini 3 Flash costs 0.67¢ per success vs. Claude Sonnet 4.6's 3.61¢.
03GPT-5 ranked 13th overall with a 44.6% success rate and the highest cost per success at 24.20¢.

Summary— our read of the original

ArbitrAI published OCR Mini-Bench, a business-focused benchmark evaluating 18 LLMs on structured data extraction from 42 real operational documents across three categories: receipts, invoices, and logistics. The benchmark ran each model 10 times per document in the overall leaderboard (3 runs for receipts), totaling 7,560 runs. Rather than reporting a single accuracy score, it tracks success rate, `pass^3` and `pass^5` (the probability of achieving consecutive successes across 3 or 5 runs), cost per successful outcome, latency, critical-field accuracy, and variance — metrics designed to reflect production reliability rather than best-case performance.

Claude Opus 4.6 ranks third at 73.0% success for 6.25¢ per success.

In the overall rankings, Gemini 3 Flash (Google, BALANCED tier) and Claude Sonnet 4.6 (Anthropic, BALANCED tier) share the top spot at 73.8% success with perfect `pass^3` and `pass^5` consistency, but Gemini 3 Flash costs just 0.67¢ per success compared to Claude Sonnet 4.6's 3.61¢. Claude Opus 4.6 ranks third at 73.0% success for 6.25¢ per success. At the budget end, Gemini 2.5 Flash-Lite achieves 58.6% success for just 0.10¢ per success. OpenAI's models fare poorly on value: GPT-5 ranks 13th overall at 44.6% success and 24.20¢ per success, while GPT-5.4 nano ranks 17th at 23.6% success for 7.05¢ per success. GPT-5 nano finishes last at 8.7% overall success.

The benchmark's `pass^n` metric is particularly revealing — models like GPT-5.4 drop sharply from 49.2% single-run success to just 35.9% on `pass^5`, indicating high variance, while models like Gemini 3 Flash maintain identical scores across all three consistency metrics, signaling highly stable outputs. The results are broken out by document category (receipts, invoices, logistics), with category-level rankings showing some model reshuffling — for example, Claude Opus 4.6 leads the receipts sub-benchmark at 84.6% success, while Gemini 3 Flash leads invoices at 91.7%.

Key facts

0118 LLMs were tested across 42 business documents in 7,560 total runs, covering receipts, invoices, and logistics categories.
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success, but Gemini 3 Flash costs 0.67¢ per success vs. Claude Sonnet 4.6's 3.61¢.
03GPT-5 ranked 13th overall with a 44.6% success rate and the highest cost per success at 24.20¢.
04GPT-5 nano finished last with just 8.7% overall success and a pass^5 rate of 4.1%.
05Gemini 2.5 Flash-Lite achieved 58.6% success at only 0.10¢ per success, the lowest cost in the overall leaderboard.
06The pass^n metric measures probability of n consecutive successes; GPT-5.4 dropped from 49.2% single-run success to 35.9% on pass^5, revealing high variance.
07Claude Opus 4.6 led the receipts sub-benchmark at 84.6% success; Gemini 3 Flash led invoices at 91.7%.

Topics

#benchmarks #ocr #model-evaluation #cost-analysis

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 19:13 UTC. How this works →

Apr 22, 2026·1 min readResearch Papers

Cheaper LLMs beat pricier rivals in business OCR benchmark

Hacker News·TimoKerr

Read at source

Composite

6.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

0118 LLMs were tested across 42 business documents in 7,560 total runs, covering receipts, invoices, and logistics categories.
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success, but Gemini 3 Flash costs 0.67¢ per success vs. Claude Sonnet 4.6's 3.61¢.
03GPT-5 ranked 13th overall with a 44.6% success rate and the highest cost per success at 24.20¢.

Summary— our read of the original

Claude Opus 4.6 ranks third at 73.0% success for 6.25¢ per success.

Key facts

0118 LLMs were tested across 42 business documents in 7,560 total runs, covering receipts, invoices, and logistics categories.
02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success, but Gemini 3 Flash costs 0.67¢ per success vs. Claude Sonnet 4.6's 3.61¢.
03GPT-5 ranked 13th overall with a 44.6% success rate and the highest cost per success at 24.20¢.
04GPT-5 nano finished last with just 8.7% overall success and a pass^5 rate of 4.1%.
05Gemini 2.5 Flash-Lite achieved 58.6% success at only 0.10¢ per success, the lowest cost in the overall leaderboard.
06The pass^n metric measures probability of n consecutive successes; GPT-5.4 dropped from 49.2% single-run success to 35.9% on pass^5, revealing high variance.
07Claude Opus 4.6 led the receipts sub-benchmark at 84.6% success; Gemini 3 Flash led invoices at 91.7%.

Topics

#benchmarks #ocr #model-evaluation #cost-analysis

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics