LLM agents struggle to infer hidden world models via interaction
A new benchmark called agentic automata learning tests whether tool-calling LLM agents can uncover hidden deterministic finite automata through membership and equivalence queries, finding that performance degrades sharply with DFA size and that all tested models fall well short of classic algorithms.
Score breakdown
The benchmark exposes concrete, measurable gaps in LLM agents' ability to infer hidden world models through interaction, providing a rigorous testbed with classical algorithm baselines that quantifies how far current agents fall short of robust interactive discovery.
- 01The paper introduces 'agentic automata learning', a benchmark for testing whether tool-calling LLM agents can uncover hidden deterministic finite automata (DFAs).
- 02Agents interact with an oracle via two query types: membership queries and equivalence queries.
- 03The testbed offers controlled task complexity, measurable interaction efficiency, and classical automata-learning algorithms as baselines.
Menaged, Lior, and Ravfogel propose agentic automata learning as a framework for evaluating how well tool-calling LLM agents can uncover hidden environments through interaction. In the benchmark, an agent must reverse-engineer a hidden deterministic finite automaton (DFA) by querying an oracle in two ways: membership queries, which ask whether a given string belongs to the target language, and equivalence queries, which ask whether a proposed DFA matches the target. This design yields a scalable testbed with precisely controlled task complexity and measurable interaction efficiency, and it comes with strong algorithmic baselines drawn from the classical automata-learning literature.
Evaluating state-of-the-art LLMs on this benchmark, the authors find that performance degrades sharply as DFA size increases.
Evaluating state-of-the-art LLMs on this benchmark, the authors find that performance degrades sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, but trajectory analyses reveal that even the best agents exhibit recurring failures across three dimensions: query planning, evidence integration, and hypothesis construction. The overall finding is that current LLM agents can sometimes accomplish non-trivial interactive discovery, but they remain far less robust and efficient than classic automata-learning algorithms for the same task, suggesting meaningful gaps in the world-model inference capabilities of today's agents.
Key facts
- 01The paper introduces 'agentic automata learning', a benchmark for testing whether tool-calling LLM agents can uncover hidden deterministic finite automata (DFAs).
- 02Agents interact with an oracle via two query types: membership queries and equivalence queries.
- 03The testbed offers controlled task complexity, measurable interaction efficiency, and classical automata-learning algorithms as baselines.
- 04Performance drops sharply as DFA size increases across all evaluated models.
- 05Reasoning models are markedly stronger than non-reasoning models on the benchmark.
- 06Trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction.
- 07Current LLM agents remain far less robust and efficient than classic algorithms for automata learning.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →