Jun 9, 2026·1 min readResearch Papers

PhysTool-Bench exposes major gaps in MLLM physical tool use

A new benchmark called PhysTool-Bench reveals that even the best multimodal large language models struggle severely with physical tool recognition and planning, with the top model (Gemini-3.1-Pro) identifying only 58.7% of tools and completing just 21.0% of queries end-to-end.

HuggingFace Papers

Read at source

Composite

4.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

PhysTool-Bench quantifies a critical and previously underexplored gap between MLLMs' strong digital API performance and their weak physical tool comprehension, pinpointing specific bottlenecks — perception and functional commonsense — that limit the development of practical embodied AI.

01PhysTool-Bench is described as the first benchmark for evaluating MLLMs on physical tool use.
02The benchmark contains 2,510 queries covering 2,678 real-world physical tools.
03Domains covered include manufacturing, electrical work, agriculture, and healthcare.

Summary— our read of the original

PhysTool-Bench is introduced as the first benchmark specifically targeting MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use — a capability described as central to embodied AI systems that instruct robots to interact with the physical world. The benchmark comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains such as manufacturing, electrical work, agriculture, and healthcare. Models are evaluated along two primary dimensions: recognizing all physical tools present in a scene, and planning the tool selection and use sequence based on instruction and visual context.

The paper frames this as a critical bottleneck for the development of practical embodied AI.

Testing across 13 leading MLLMs revealed stark performance gaps. The strongest model, Gemini-3.1-Pro, identified only 58.7% of tools in a scene and completed merely 21.0% of queries end-to-end. The analysis identifies a two-level deficit: MLLMs first struggle to perceive tools in realistic scenes, and then suffer a much larger performance drop at the planning stage, indicating a lack of functional commonsense for mapping perceived tools onto task semantics. The paper frames this as a critical bottleneck for the development of practical embodied AI.

Key facts

01PhysTool-Bench is described as the first benchmark for evaluating MLLMs on physical tool use.
02The benchmark contains 2,510 queries covering 2,678 real-world physical tools.
03Domains covered include manufacturing, electrical work, agriculture, and healthcare.
0413 leading MLLMs were evaluated on the benchmark.
05The top-performing model, Gemini-3.1-Pro, identified only 58.7% of tools in a scene.
06Gemini-3.1-Pro completed just 21.0% of queries end-to-end.
07A two-level deficit was identified: tool perception failures and a larger drop at the planning stage due to lack of functional commonsense.

Topics

#benchmarks #tool-use #mllm-evaluation #embodied-ai

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 10, 2026 · 15:34 UTC. How this works →

PhysTool-Bench exposes major gaps in MLLM physical tool use

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.