AdaPlanBench tests LLM agents on adaptive planning under dual constraints
Researchers introduce AdaPlanBench, a dynamic benchmark built on 307 household tasks that evaluates whether LLM agents can adaptively re-plan as world and user constraints are progressively revealed through interaction.
Score breakdown
AdaPlanBench fills a gap in LLM evaluation by providing a structured testbed for dual-constrained interactive planning, and its results — with the best model topping out at 67.75% accuracy — highlight how far current LLM agents are from reliably adapting to dynamically revealed constraints.
- 01AdaPlanBench is a dynamic interactive benchmark for evaluating LLM agents on adaptive planning under dual (world and user) constraints.
- 02The benchmark is built on 307 household tasks with a scalable constraint construction pipeline.
- 03Hidden constraints are revealed only when an agent proposes a plan that violates them, requiring iterative re-planning.
Jiayu Liu, Cheng Qian, and Zhenhailong Wang present AdaPlanBench, a benchmark targeting a gap in existing LLM evaluation: most benchmarks do not test adaptive planning when both world and user constraints are revealed progressively rather than specified upfront. AdaPlanBench is built on 307 household tasks and uses a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents engage in a multi-turn protocol where hidden constraints surface only after the agent proposes a plan that violates them, forcing iterative revision under accumulating feedback.
Experiments on ten leading LLMs reveal that dual-constrained adaptive planning remains a significant challenge.
Experiments on ten leading LLMs reveal that dual-constrained adaptive planning remains a significant challenge. The best-performing model reached only 67.75% accuracy, and performance consistently degraded as the number of accumulated constraints grew. User constraints posed a particularly large challenge compared to world constraints, and failure analysis points to weaker physical grounding and reduced re-planning effectiveness as primary causes. The authors frame AdaPlanBench as a testbed for studying reliable adaptation to dynamically revealed constraints in LLM agents.
Key facts
- 01AdaPlanBench is a dynamic interactive benchmark for evaluating LLM agents on adaptive planning under dual (world and user) constraints.
- 02The benchmark is built on 307 household tasks with a scalable constraint construction pipeline.
- 03Hidden constraints are revealed only when an agent proposes a plan that violates them, requiring iterative re-planning.
- 04Ten leading LLMs were evaluated on the benchmark.
- 05The best-performing model achieved only 67.75% accuracy.
- 06Performance degrades as more constraints accumulate over the course of interaction.
- 07User constraints posed a particularly large challenge, with failures often linked to weaker physical grounding.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →