PhysTool-Bench exposes major gaps in MLLM physical tool use
A new benchmark called PhysTool-Bench reveals that even the best multimodal large language models struggle severely with physical tool recognition and planning, with the top model (Gemini-3.1-Pro) identifying only 58.7% of tools and completing just 21.0% of queries end-to-end.
Score breakdown
PhysTool-Bench quantifies a critical and previously underexplored gap between MLLMs' strong digital API performance and their weak physical tool comprehension, pinpointing specific bottlenecks — perception and functional commonsense — that limit the development of practical embodied AI.
- 01PhysTool-Bench is described as the first benchmark for evaluating MLLMs on physical tool use.
- 02The benchmark contains 2,510 queries covering 2,678 real-world physical tools.
- 03Domains covered include manufacturing, electrical work, agriculture, and healthcare.
PhysTool-Bench is introduced as the first benchmark specifically targeting MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use — a capability described as central to embodied AI systems that instruct robots to interact with the physical world. The benchmark comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains such as manufacturing, electrical work, agriculture, and healthcare. Models are evaluated along two primary dimensions: recognizing all physical tools present in a scene, and planning the tool selection and use sequence based on instruction and visual context.
The paper frames this as a critical bottleneck for the development of practical embodied AI.
Testing across 13 leading MLLMs revealed stark performance gaps. The strongest model, Gemini-3.1-Pro, identified only 58.7% of tools in a scene and completed merely 21.0% of queries end-to-end. The analysis identifies a two-level deficit: MLLMs first struggle to perceive tools in realistic scenes, and then suffer a much larger performance drop at the planning stage, indicating a lack of functional commonsense for mapping perceived tools onto task semantics. The paper frames this as a critical bottleneck for the development of practical embodied AI.
Key facts
- 01PhysTool-Bench is described as the first benchmark for evaluating MLLMs on physical tool use.
- 02The benchmark contains 2,510 queries covering 2,678 real-world physical tools.
- 03Domains covered include manufacturing, electrical work, agriculture, and healthcare.
- 0413 leading MLLMs were evaluated on the benchmark.
- 05The top-performing model, Gemini-3.1-Pro, identified only 58.7% of tools in a scene.
- 06Gemini-3.1-Pro completed just 21.0% of queries end-to-end.
- 07A two-level deficit was identified: tool perception failures and a larger drop at the planning stage due to lack of functional commonsense.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 10, 2026 · 15:34 UTC. How this works →