ToolMenuBench benchmarks how tool visibility shapes LLM agent reliability
Researchers introduce ToolMenuBench, a benchmark evaluating how the set of tools shown to LLM agents affects task success, error rates, token usage, and risky-tool exposure across seven model backends and six filtering methods.
Score breakdown
The benchmark demonstrates that tool-menu composition — not just model capability — is a primary driver of agent task success, error rates, and safety-relevant risk exposure, with CMTF cutting token usage by roughly 98% and more than doubling task success over unfiltered baselines.
- 01ToolMenuBench is a new benchmark for evaluating tool-menu construction in multi-step LLM agents, introduced by Rahul Suresh Babu and Laxmipriya Ganesh Iyer.
- 02The benchmark varies tool-menu size, distractor type, state-dependent task structure, and risk exposure.
- 03Metrics tracked include visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.
Rahul Suresh Babu and Laxmipriya Ganesh Iyer present ToolMenuBench, a benchmark designed to address a gap in existing LLM agent evaluations: most prior work asks whether a model can call a tool correctly, but largely ignores how the composition of the visible tool menu influences downstream agent behavior. ToolMenuBench systematically varies tool-menu size, distractor type, state-dependent task structure, and risk exposure, and tracks both filter-level metrics (visible-tool count, risky-tool exposure) and downstream agent metrics (task success, wrong-tool calls, premature actions, and token usage).
The benchmark spans a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.
The benchmark spans a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings. The standout result is for causal minimal tool filtering (CMTF), which improved task success from 32.1% under all-tools exposure to 85.7%, while reducing average token usage by roughly 98%. CMTF also achieved the strongest overall tradeoff across all tracked dimensions — reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure — relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines. The authors frame ToolMenuBench as a reusable evaluation framework for the "agent-interface problem": determining which tools should be visible, when they should be visible, and under what cost or risk constraints.
Key facts
- 01ToolMenuBench is a new benchmark for evaluating tool-menu construction in multi-step LLM agents, introduced by Rahul Suresh Babu and Laxmipriya Ganesh Iyer.
- 02The benchmark varies tool-menu size, distractor type, state-dependent task structure, and risk exposure.
- 03Metrics tracked include visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.
- 04Evaluation spans seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.
- 05Causal minimal tool filtering (CMTF) improved task success from 32.1% (all-tools exposure) to 85.7%.
- 06CMTF reduced average token usage by roughly 98% compared to unfiltered exposure.
- 07CMTF outperformed lexical filtering, state-aware filtering, and broader causal-path baselines on all key metrics.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →