EvoArena benchmark exposes LLM agents' struggle with dynamic environments
Researchers introduce EvoArena, a benchmark for evaluating LLM agents in dynamic environments, and EvoMem, a patch-based memory paradigm that improves agent performance on evolving tasks.
Score breakdown
The paper demonstrates that static-environment benchmarks fail to capture real-world agent deployment challenges, and that EvoMem's structured update histories directly improve agent accuracy on both the new EvoArena benchmark and established benchmarks like GAIA and LoCoMo.
- 01EvoArena is a new benchmark suite modeling environment changes as progressive updates across terminal, software, and social domains.
- 02Current LLM agents achieve an average accuracy of only 39.6% on EvoArena.
- 03EvoMem is a patch-based memory paradigm that records memory evolution as structured update histories.
Jundong Xu, Qingchuan Li, and Jiaying Wu identify a critical blind spot in LLM agent evaluation: most benchmarks assume static environments, while real-world deployment requires agents to continuously adapt to changing conditions. To close this gap, they introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates spanning terminal, software, and social-preference domains. Experiments reveal that current agents struggle significantly, achieving an average accuracy of only 39.6% across these evolving domains.
To improve agent performance in dynamic settings, the authors propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories rather than static snapshots.
To improve agent performance in dynamic settings, the authors propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories rather than static snapshots. This design enables agents to reason about environmental change by tracking how their memory has shifted over time. EvoMem consistently improves results: it yields an average gain of 1.5% on EvoArena, 6.1% on the GAIA benchmark, and 4.8% on LoCoMo. It also improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis indicates that EvoMem improves evidence capture in memory, suggesting better preservation of complete evolving environment states.
Key facts
- 01EvoArena is a new benchmark suite modeling environment changes as progressive updates across terminal, software, and social domains.
- 02Current LLM agents achieve an average accuracy of only 39.6% on EvoArena.
- 03EvoMem is a patch-based memory paradigm that records memory evolution as structured update histories.
- 04EvoMem yields an average gain of 1.5% on EvoArena over baseline agents.
- 05EvoMem improves performance on GAIA by 6.1% and on LoCoMo by 4.8%.
- 06EvoMem improves chain-level accuracy by 3.7% on EvoArena, where agents must complete consecutive sequences of related evolutionary subtasks.
- 07Mechanistic analysis shows EvoMem improves evidence capture, indicating better preservation of complete evolving environment states.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →