OMIBench tests LVLMs on multi-image Olympiad reasoning
Researchers introduce OMIBench, a benchmark evaluating large vision-language models on Olympiad-level problems that require reasoning across multiple images — where even the strongest models reach only about 50% accuracy.
Score breakdown
Teams building or evaluating LVLMs for complex scientific reasoning should adopt OMIBench to identify weaknesses in multi-image context synthesis — a capability that single-image benchmarks systematically overlook.
- 01OMIBench is a new benchmark for evaluating Olympiad-level reasoning that requires evidence distributed across multiple images.
- 02It covers problems from biology, chemistry, mathematics, and physics Olympiads.
- 03The benchmark includes manually annotated rationales for each problem.
Qiguang Chen, Chengyu Luan, and Jiajun Wu present OMIBench, a new benchmark designed to address a blind spot in current Olympiad-level multimodal evaluation: the reliance on single-image analysis. Existing benchmarks for large vision-language models (LVLMs) largely ignore scenarios where solving a problem requires synthesizing contextual information spread across multiple images. OMIBench fills this gap by assembling problems from four scientific disciplines — biology, chemistry, mathematics, and physics — sourced from Olympiad competitions.
The benchmark is accompanied by manually annotated rationales and supports two evaluation modes: exact answer matching and semantic answer matching, allowing for more nuanced assessment of model responses.
The benchmark is accompanied by manually annotated rationales and supports two evaluation modes: exact answer matching and semantic answer matching, allowing for more nuanced assessment of model responses. Experiments across a range of LVLMs reveal meaningful performance gaps, with the best-performing model, Gemini-3-Pro, reaching only approximately 50% accuracy. The authors position OMIBench as a focused resource for the research community to study and improve multi-image reasoning capabilities in LVLMs.
Key facts
- 01OMIBench is a new benchmark for evaluating Olympiad-level reasoning that requires evidence distributed across multiple images.
- 02It covers problems from biology, chemistry, mathematics, and physics Olympiads.
- 03The benchmark includes manually annotated rationales for each problem.
- 04Evaluation supports both exact and semantic answer matching protocols.
- 05The strongest tested model, Gemini-3-Pro, achieves only about 50% accuracy on OMIBench.
- 06Current Olympiad-level multimodal benchmarks are criticized for focusing on single-image analysis and ignoring cross-image context.