Jun 13, 2026·1 min readResearch Papers

ToolMenuBench benchmarks how tool visibility shapes LLM agent reliability

Researchers introduce ToolMenuBench, a benchmark evaluating how the set of tools shown to LLM agents affects task success, error rates, token usage, and risky-tool exposure across seven model backends and six filtering methods.

ArXiv·Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Read at source

Composite

6.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The benchmark demonstrates that tool-menu composition — not just model capability — is a primary driver of agent task success, error rates, and safety-relevant risk exposure, with CMTF cutting token usage by roughly 98% and more than doubling task success over unfiltered baselines.

01ToolMenuBench is a new benchmark for evaluating tool-menu construction in multi-step LLM agents, introduced by Rahul Suresh Babu and Laxmipriya Ganesh Iyer.
02The benchmark varies tool-menu size, distractor type, state-dependent task structure, and risk exposure.
03Metrics tracked include visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.

Summary— our read of the original

Rahul Suresh Babu and Laxmipriya Ganesh Iyer present ToolMenuBench, a benchmark designed to address a gap in existing LLM agent evaluations: most prior work asks whether a model can call a tool correctly, but largely ignores how the composition of the visible tool menu influences downstream agent behavior. ToolMenuBench systematically varies tool-menu size, distractor type, state-dependent task structure, and risk exposure, and tracks both filter-level metrics (visible-tool count, risky-tool exposure) and downstream agent metrics (task success, wrong-tool calls, premature actions, and token usage).

The benchmark spans a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.

The benchmark spans a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings. The standout result is for causal minimal tool filtering (CMTF), which improved task success from 32.1% under all-tools exposure to 85.7%, while reducing average token usage by roughly 98%. CMTF also achieved the strongest overall tradeoff across all tracked dimensions — reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure — relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines. The authors frame ToolMenuBench as a reusable evaluation framework for the "agent-interface problem": determining which tools should be visible, when they should be visible, and under what cost or risk constraints.

Key facts

01ToolMenuBench is a new benchmark for evaluating tool-menu construction in multi-step LLM agents, introduced by Rahul Suresh Babu and Laxmipriya Ganesh Iyer.
02The benchmark varies tool-menu size, distractor type, state-dependent task structure, and risk exposure.
03Metrics tracked include visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.
04Evaluation spans seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.
05Causal minimal tool filtering (CMTF) improved task success from 32.1% (all-tools exposure) to 85.7%.
06CMTF reduced average token usage by roughly 98% compared to unfiltered exposure.
07CMTF outperformed lexical filtering, state-aware filtering, and broader causal-path baselines on all key metrics.

Topics

#benchmarks #tool-use #agent-framework #safety #multi-agent

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →

Jun 13, 2026·1 min readResearch Papers

ToolMenuBench benchmarks how tool visibility shapes LLM agent reliability

ArXiv·Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Read at source

Composite

6.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01ToolMenuBench is a new benchmark for evaluating tool-menu construction in multi-step LLM agents, introduced by Rahul Suresh Babu and Laxmipriya Ganesh Iyer.
02The benchmark varies tool-menu size, distractor type, state-dependent task structure, and risk exposure.
03Metrics tracked include visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.

Summary— our read of the original

The benchmark spans a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.

Key facts

01ToolMenuBench is a new benchmark for evaluating tool-menu construction in multi-step LLM agents, introduced by Rahul Suresh Babu and Laxmipriya Ganesh Iyer.
02The benchmark varies tool-menu size, distractor type, state-dependent task structure, and risk exposure.
03Metrics tracked include visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage.
04Evaluation spans seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings.
05Causal minimal tool filtering (CMTF) improved task success from 32.1% (all-tools exposure) to 85.7%.
06CMTF reduced average token usage by roughly 98% compared to unfiltered exposure.
07CMTF outperformed lexical filtering, state-aware filtering, and broader causal-path baselines on all key metrics.

Topics

#benchmarks #tool-use #agent-framework #safety #multi-agent

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.