Apr 22, 2026·1 min readResearch Papers

OMIBench tests LVLMs on multi-image Olympiad reasoning

Researchers introduce OMIBench, a benchmark evaluating large vision-language models on Olympiad-level problems that require reasoning across multiple images — where even the strongest models reach only about 50% accuracy.

ArXiv·Qiguang Chen, Chengyu Luan, Jiajun Wu

Read at source

Composite

5.3

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams building or evaluating LVLMs for complex scientific reasoning should adopt OMIBench to identify weaknesses in multi-image context synthesis — a capability that single-image benchmarks systematically overlook.

01OMIBench is a new benchmark for evaluating Olympiad-level reasoning that requires evidence distributed across multiple images.
02It covers problems from biology, chemistry, mathematics, and physics Olympiads.
03The benchmark includes manually annotated rationales for each problem.

Summary— our read of the original

Qiguang Chen, Chengyu Luan, and Jiajun Wu present OMIBench, a new benchmark designed to address a blind spot in current Olympiad-level multimodal evaluation: the reliance on single-image analysis. Existing benchmarks for large vision-language models (LVLMs) largely ignore scenarios where solving a problem requires synthesizing contextual information spread across multiple images. OMIBench fills this gap by assembling problems from four scientific disciplines — biology, chemistry, mathematics, and physics — sourced from Olympiad competitions.

The benchmark is accompanied by manually annotated rationales and supports two evaluation modes: exact answer matching and semantic answer matching, allowing for more nuanced assessment of model responses.

The benchmark is accompanied by manually annotated rationales and supports two evaluation modes: exact answer matching and semantic answer matching, allowing for more nuanced assessment of model responses. Experiments across a range of LVLMs reveal meaningful performance gaps, with the best-performing model, Gemini-3-Pro, reaching only approximately 50% accuracy. The authors position OMIBench as a focused resource for the research community to study and improve multi-image reasoning capabilities in LVLMs.

Key facts

01OMIBench is a new benchmark for evaluating Olympiad-level reasoning that requires evidence distributed across multiple images.
02It covers problems from biology, chemistry, mathematics, and physics Olympiads.
03The benchmark includes manually annotated rationales for each problem.
04Evaluation supports both exact and semantic answer matching protocols.
05The strongest tested model, Gemini-3-Pro, achieves only about 50% accuracy on OMIBench.
06Current Olympiad-level multimodal benchmarks are criticized for focusing on single-image analysis and ignoring cross-image context.

Topics

#benchmarks #vision-language-models #multi-modal-reasoning #olympiad-level #evaluation

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 23, 2026 · 11:04 UTC. How this works →

Apr 22, 2026·1 min readResearch Papers

OMIBench tests LVLMs on multi-image Olympiad reasoning

ArXiv·Qiguang Chen, Chengyu Luan, Jiajun Wu

Read at source

Composite

5.3

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01OMIBench is a new benchmark for evaluating Olympiad-level reasoning that requires evidence distributed across multiple images.
02It covers problems from biology, chemistry, mathematics, and physics Olympiads.
03The benchmark includes manually annotated rationales for each problem.

Summary— our read of the original

The benchmark is accompanied by manually annotated rationales and supports two evaluation modes: exact answer matching and semantic answer matching, allowing for more nuanced assessment of model responses.

Key facts

01OMIBench is a new benchmark for evaluating Olympiad-level reasoning that requires evidence distributed across multiple images.
02It covers problems from biology, chemistry, mathematics, and physics Olympiads.
03The benchmark includes manually annotated rationales for each problem.
04Evaluation supports both exact and semantic answer matching protocols.
05The strongest tested model, Gemini-3-Pro, achieves only about 50% accuracy on OMIBench.
06Current Olympiad-level multimodal benchmarks are criticized for focusing on single-image analysis and ignoring cross-image context.

Topics

#benchmarks #vision-language-models #multi-modal-reasoning #olympiad-level #evaluation

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics