OCR benchmark of 18 LLMs finds cheaper models beat expensive ones
ArbitrAI's OCR Mini-Bench ran 7,560 calls across 18 LLMs on 42 real business documents and found that budget and mid-tier models like Gemini 3 Flash and Claude Sonnet 4.6 match or outperform expensive SOTA models on structured document extraction tasks.
Score breakdown
Teams building production OCR pipelines can use this benchmark to avoid overpaying for SOTA models — Gemini 3 Flash matches top-tier accuracy at a fraction of the cost, and the `pass^n` consistency metric helps identify models that are reliable enough for automated workflows.
- 0142 business documents, 18 models, and 7,560 total runs were used in the benchmark
- 02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success rate
- 03Gemini 3 Flash costs 0.67¢ per success; Claude Sonnet 4.6 costs 3.61¢ per success for the same score
ArbitrAI's OCR Mini-Bench is a business-focused benchmark designed to evaluate LLM-based document extraction on real operational documents. Across 42 documents, 18 models, and 7,560 total runs, it measures critical-field success, all-field accuracy, `pass^3` and `pass^5` consistency scores, latency, and cost per successful outcome — metrics chosen to reflect production realities rather than one-shot performance.
Claude Opus 4.6 ranks third at 73.0% success for 6.25¢/success.
In the overall leaderboard, Gemini 3 Flash (Google, BALANCED tier) and Claude Sonnet 4.6 (Anthropic, BALANCED tier) share the top spot at 73.8% success with perfect `pass^3` and `pass^5` consistency, but Gemini 3 Flash costs just 0.67¢ per success at 16.0s latency versus Claude Sonnet 4.6's 3.61¢ at 18.9s. Claude Opus 4.6 ranks third at 73.0% success for 6.25¢/success. By contrast, GPT-5 ranks 13th at 44.6% success and a steep 24.20¢ per success, and GPT-5.4 ranks 10th at 49.2% success for 7.82¢/success — both far behind on cost-efficiency.
Budget models also show competitive results in specific categories. On receipts, Gemini 2.5 Flash-Lite achieves 53.9% success at just 0.06¢/success. On invoices, Gemini 3 Flash tops the chart at 91.7% success. The `pass^n` metric reveals meaningful reliability gaps: GPT-5.4's overall success drops from 49.2% to 35.9% at `pass^5`, indicating high run-to-run variance, while several Gemini models maintain identical scores across `pass^3` and `pass^5`, signaling strong consistency.
Key facts
- 0142 business documents, 18 models, and 7,560 total runs were used in the benchmark
- 02Gemini 3 Flash and Claude Sonnet 4.6 tied for first overall at 73.8% success rate
- 03Gemini 3 Flash costs 0.67¢ per success; Claude Sonnet 4.6 costs 3.61¢ per success for the same score
- 04GPT-5 ranked 13th overall at 44.6% success and 24.20¢ per success — the highest cost/success in the leaderboard
- 05The `pass^n` metric measures probability of n consecutive successes, exposing consistency issues hidden by single-run scores
- 06GPT-5.4's success rate drops from 49.2% to 35.9% at `pass^5`, indicating high variance
- 07Gemini 2.5 Flash-Lite achieves 0.06¢ per success on receipts, the lowest cost in that category