Jun 8, 2026·1 min readResearch Papers

SpatialWorld benchmark tests multimodal agents on real-world spatial tasks

SpatialWorld is a new benchmark evaluating interactive spatial reasoning in multimodal agents across 760 real-world tasks, with even the strongest model, GPT-5, achieving only a 17.4% task success rate.

ArXiv·Hongcheng Gao, Hailong Qu, Jingyi Tang

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Benchmark results showing even GPT-5 topping out at 17.4% TSR highlight how far current MLLMs are from reliable spatial reasoning, giving practitioners a rigorous testbed to measure progress on active exploration and long-horizon planning.

01SpatialWorld is a new benchmark for evaluating interactive spatial reasoning in multimodal large language models (MLLMs).
02It integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol.
03The benchmark contains 760 human-annotated tasks across domains such as household routines, travel, and social collaboration.

Summary— our read of the original

Hongcheng Gao, Hailong Qu, and Jingyi Tang introduce SpatialWorld, a benchmark designed to address a key gap in evaluating multimodal large language models (MLLMs): their ability to perform interactive spatial reasoning in complex, real-world settings. The authors argue that existing benchmarks fall short because they rely on passive evaluation formats like static visual question answering (VQA) or are tied to simulator-specific pipelines, neither of which captures general interactive spatial understanding. SpatialWorld addresses this by integrating eight heterogeneous simulation backends under a single, simulator-agnostic protocol, enabling broad and consistent evaluation.

The benchmark comprises 760 human-annotated tasks across diverse domains including household routines, travel, and social collaboration.

The benchmark comprises 760 human-annotated tasks across diverse domains including household routines, travel, and social collaboration. Agents must operate under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions through a unified, text-based action interface native to MLLMs. Each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier to ensure reliable evaluation.

An evaluation of 15 advanced agents reveals that robust spatial task solving remains a significant challenge. The strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. The analysis also uncovers a mismatch between task success and execution efficiency, as well as substantial domain-specific performance variations, pointing to bottlenecks in active exploration and long-horizon planning as key areas for future research.

Key facts

01SpatialWorld is a new benchmark for evaluating interactive spatial reasoning in multimodal large language models (MLLMs).
02It integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol.
03The benchmark contains 760 human-annotated tasks across domains such as household routines, travel, and social collaboration.
04Agents operate under vision-only partial observability using a unified, text-based action interface.
05Each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier.
06GPT-5, the strongest model evaluated, achieves a task success rate (TSR) of only 17.4%.
07The leading open-source model, Qwen-3.5, reaches a TSR of 14.1% across 15 agents evaluated.

Topics

#benchmarks #multimodal-agents #spatial-reasoning #evaluation #agent-frameworks

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 09:19 UTC. How this works →

Jun 8, 2026·1 min readResearch Papers

SpatialWorld benchmark tests multimodal agents on real-world spatial tasks

ArXiv·Hongcheng Gao, Hailong Qu, Jingyi Tang

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01SpatialWorld is a new benchmark for evaluating interactive spatial reasoning in multimodal large language models (MLLMs).
02It integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol.
03The benchmark contains 760 human-annotated tasks across domains such as household routines, travel, and social collaboration.

Summary— our read of the original

The benchmark comprises 760 human-annotated tasks across diverse domains including household routines, travel, and social collaboration.

Key facts

01SpatialWorld is a new benchmark for evaluating interactive spatial reasoning in multimodal large language models (MLLMs).
02It integrates eight heterogeneous simulation backends under a shared, simulator-agnostic protocol.
03The benchmark contains 760 human-annotated tasks across domains such as household routines, travel, and social collaboration.
04Agents operate under vision-only partial observability using a unified, text-based action interface.
05Each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier.
06GPT-5, the strongest model evaluated, achieves a task success rate (TSR) of only 17.4%.
07The leading open-source model, Qwen-3.5, reaches a TSR of 14.1% across 15 agents evaluated.

Topics

#benchmarks #multimodal-agents #spatial-reasoning #evaluation #agent-frameworks

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.