SpatialWorld benchmark tests multimodal agents on real-world spatial tasks
SpatialWorld is a new benchmark evaluating interactive spatial reasoning in multimodal agents across 760 real-world tasks, with even the strongest model, GPT-5, achieving only a 17.4% task success rate.
Score breakdown
Benchmark results showing even GPT-5 topping out at 17.4% TSR highlight how far current MLLMs are from reliable spatial reasoning, giving practitioners a rigorous testbed to measure progress on active exploration and long-horizon planning.
- 01SpatialWorld is a new benchmark for evaluating interactive spatial reasoning in multimodal large language models (MLLMs).
- 02It integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol.
- 03The benchmark contains 760 human-annotated tasks across domains such as household routines, travel, and social collaboration.
Hongcheng Gao, Hailong Qu, and Jingyi Tang introduce SpatialWorld, a benchmark designed to address a key gap in evaluating multimodal large language models (MLLMs): their ability to perform interactive spatial reasoning in complex, real-world settings. The authors argue that existing benchmarks fall short because they rely on passive evaluation formats like static visual question answering (VQA) or are tied to simulator-specific pipelines, neither of which captures general interactive spatial understanding. SpatialWorld addresses this by integrating eight heterogeneous simulation backends under a single, simulator-agnostic protocol, enabling broad and consistent evaluation.
The benchmark comprises 760 human-annotated tasks across diverse domains including household routines, travel, and social collaboration.
The benchmark comprises 760 human-annotated tasks across diverse domains including household routines, travel, and social collaboration. Agents must operate under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions through a unified, text-based action interface native to MLLMs. Each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier to ensure reliable evaluation.
An evaluation of 15 advanced agents reveals that robust spatial task solving remains a significant challenge. The strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. The analysis also uncovers a mismatch between task success and execution efficiency, as well as substantial domain-specific performance variations, pointing to bottlenecks in active exploration and long-horizon planning as key areas for future research.
Key facts
- 01SpatialWorld is a new benchmark for evaluating interactive spatial reasoning in multimodal large language models (MLLMs).
- 02It integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol.
- 03The benchmark contains 760 human-annotated tasks across domains such as household routines, travel, and social collaboration.
- 04Agents operate under vision-only partial observability using a unified, text-based action interface.
- 05Each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier.
- 06GPT-5, the strongest model evaluated, achieves a task success rate (TSR) of only 17.4%.
- 07The leading open-source model, Qwen-3.5, reaches a TSR of 14.1% across 15 agents evaluated.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 09:19 UTC. How this works →