LLM agents defer blindly to GNN tools, and stronger models defer more
A study by Zhongyuan Wang and Pratyusha Vemuri finds that LLM agents equipped with GNN tools do not exercise independent judgment — they echo the tool's output up to 99.2% of the time, and this blind deference actually increases as model capability scales up.
Score breakdown
The findings show that agent+tool evaluations cannot assume the agent adds judgment on top of the tool — and that the gap between parrot behavior and optimal action widens, not shrinks, as LLM capability scales.
- 01Agents agreed with the raw GNN's predictions 97.6–99.2% of the time across 5 seeds, a behavior the authors call 'GNN parrot'.
- 02Deference is not a weak-model artifact: agreement rises with backbone capability, from 0.60 at 1.5B to 0.98 at 7B (Qwen2.5).
- 03Experiments used a ReAct-style agent on node classification over ogbn-arxiv and WikiCS text-attributed graphs.
Zhongyuan Wang and Pratyusha Vemuri expose a frozen GNN to a ReAct-style LLM agent as an explicit callable tool and measure whether the agent exercises judgment over when and how much to rely on it, using node classification on text-attributed graphs (ogbn-arxiv, replicated on WikiCS). The core finding is stark: agents agree with the raw GNN's output 97.6–99.2% of the time across 5 seeds, collapsing into what the authors call a "GNN parrot" that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability across Qwen2.5 models from 0.5B to 7B, the study shows deference is not a weak-model artifact — among models able to invoke the tool at all, agreement rises with capability, from 0.60 at 1.5B to 0.98 at 7B.
The cost of this blind deference is significant and grows as model capability improves.
The cost of this blind deference is significant and grows as model capability improves. A per-node oracle over available actions beats the parrot by 0.09–0.18 at 3B and 0.12–0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's own alternatives improve with scale. At 7B, a simple neighbour-label tool actually outperforms the GNN at high homophily (0.81 vs. 0.71), yet the agent still defers to the GNN. A selective-invocation gate recovers roughly half of the high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom — suggesting the ceiling on selective invocation is limited by available information, not just router design. The authors conclude that evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and that selective invocation must be deliberately designed in rather than expected to emerge from scale.
Key facts
- 01Agents agreed with the raw GNN's predictions 97.6–99.2% of the time across 5 seeds, a behavior the authors call 'GNN parrot'.
- 02Deference is not a weak-model artifact: agreement rises with backbone capability, from 0.60 at 1.5B to 0.98 at 7B (Qwen2.5).
- 03Experiments used a ReAct-style agent on node classification over ogbn-arxiv and WikiCS text-attributed graphs.
- 04A per-node oracle beats the parrot by 0.09–0.18 at 3B and 0.12–0.22 at 7B, with the gap roughly doubling at high homophily.
- 05At 7B, a simple neighbour-label tool outperforms the GNN at high homophily (0.81 vs. 0.71), yet the agent still defers to the GNN.
- 06A selective-invocation gate recovers about half the high-homophily gap (0.71 to 0.83) but yields no net global gain.
- 07Held-out estimates bound the best achievable gate to at most a third of the oracle headroom, suggesting the limit is informational, not just a router design problem.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →