Cheaper LLMs beat pricier rivals in business OCR benchmark
ArbitrAI's OCR Mini-Bench tested 18 LLMs across 7,560 runs on 42 real business documents, finding that budget and mid-range models like Gemini 3 Flash and Claude Sonnet 4.6 outperform far more expensive options on cost-adjusted OCR quality.
Score breakdown
Teams building production document-processing pipelines should evaluate cost-per-success and consistency metrics like `pass^5` rather than peak accuracy alone, as this benchmark shows budget and mid-range models can dramatically outperform expensive SOTA models on real business OCR tasks.
- 0118 LLMs were tested across 42 business documents in 7,560 total runs, covering receipts, invoices, and logistics categories.
- 02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success, but Gemini 3 Flash costs 0.67¢ per success vs. Claude Sonnet 4.6's 3.61¢.
- 03GPT-5 ranked 13th overall with a 44.6% success rate and the highest cost per success at 24.20¢.
ArbitrAI published OCR Mini-Bench, a business-focused benchmark evaluating 18 LLMs on structured data extraction from 42 real operational documents across three categories: receipts, invoices, and logistics. The benchmark ran each model 10 times per document in the overall leaderboard (3 runs for receipts), totaling 7,560 runs. Rather than reporting a single accuracy score, it tracks success rate, `pass^3` and `pass^5` (the probability of achieving consecutive successes across 3 or 5 runs), cost per successful outcome, latency, critical-field accuracy, and variance — metrics designed to reflect production reliability rather than best-case performance.
Claude Opus 4.6 ranks third at 73.0% success for 6.25¢ per success.
In the overall rankings, Gemini 3 Flash (Google, BALANCED tier) and Claude Sonnet 4.6 (Anthropic, BALANCED tier) share the top spot at 73.8% success with perfect `pass^3` and `pass^5` consistency, but Gemini 3 Flash costs just 0.67¢ per success compared to Claude Sonnet 4.6's 3.61¢. Claude Opus 4.6 ranks third at 73.0% success for 6.25¢ per success. At the budget end, Gemini 2.5 Flash-Lite achieves 58.6% success for just 0.10¢ per success. OpenAI's models fare poorly on value: GPT-5 ranks 13th overall at 44.6% success and 24.20¢ per success, while GPT-5.4 nano ranks 17th at 23.6% success for 7.05¢ per success. GPT-5 nano finishes last at 8.7% overall success.
The benchmark's `pass^n` metric is particularly revealing — models like GPT-5.4 drop sharply from 49.2% single-run success to just 35.9% on `pass^5`, indicating high variance, while models like Gemini 3 Flash maintain identical scores across all three consistency metrics, signaling highly stable outputs. The results are broken out by document category (receipts, invoices, logistics), with category-level rankings showing some model reshuffling — for example, Claude Opus 4.6 leads the receipts sub-benchmark at 84.6% success, while Gemini 3 Flash leads invoices at 91.7%.
Key facts
- 0118 LLMs were tested across 42 business documents in 7,560 total runs, covering receipts, invoices, and logistics categories.
- 02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success, but Gemini 3 Flash costs 0.67¢ per success vs. Claude Sonnet 4.6's 3.61¢.
- 03GPT-5 ranked 13th overall with a 44.6% success rate and the highest cost per success at 24.20¢.
- 04GPT-5 nano finished last with just 8.7% overall success and a pass^5 rate of 4.1%.
- 05Gemini 2.5 Flash-Lite achieved 58.6% success at only 0.10¢ per success, the lowest cost in the overall leaderboard.