RetailBench tests LLM agents on 180-day supermarket management
RetailBench is a new simulation benchmark that evaluates tool-using LLM agents on long-horizon retail management tasks, finding that most models fail to survive a full 180-day evaluation and all fall well short of an oracle policy.
Score breakdown
RetailBench exposes that current LLMs cannot sustain coherent long-horizon decision-making in economically grounded environments, with most models failing to complete even a 180-day simulation and all falling substantially short of an oracle policy on net worth and sales.
- 01RetailBench is a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operations.
- 02The benchmark models retail management as a partially observable decision process and supports thousand-day-scale simulations.
- 03Agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints.
RetailBench, introduced by Linghua Zhang, Jun Wang, and Jingtong Wu, is a data-grounded simulation benchmark targeting a gap in current LLM agent evaluation: while agents have shown strong performance on short-horizon, well-scoped tasks, their ability to sustain coherent decisions across dynamic, long-horizon environments has remained unclear. The benchmark simulates single-store supermarket operations, framing retail management as a partially observable decision process capable of supporting thousand-day-scale simulations. Agents must juggle a broad set of interdependent responsibilities including pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints.
Behavioral analysis identified three root causes for these gaps — incomplete evidence acquisition, surface-level decision making, and the absence of a consistent long-horizon policy.
Seven contemporary LLMs were tested under representative agent frameworks across a 180-day evaluation horizon, with results compared against a privileged oracle policy. The findings reveal substantial variation across models: only a small subset survived the full 180-day horizon, and even the best-performing LLM runs remained substantially behind the oracle in final net worth and sales outcomes. Behavioral analysis identified three root causes for these gaps — incomplete evidence acquisition, surface-level decision making, and the absence of a consistent long-horizon policy. The authors position RetailBench as a controlled testbed for studying reliable autonomy in economically grounded, long-horizon decision-making.
Key facts
- 01RetailBench is a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operations.
- 02The benchmark models retail management as a partially observable decision process and supports thousand-day-scale simulations.
- 03Agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints.
- 04Seven contemporary LLMs were evaluated under representative agent frameworks over a 180-day evaluation horizon.
- 05Only a small subset of models survived the full 180-day evaluation horizon.
- 06Even the strongest LLM runs remained substantially behind the oracle policy in final net worth and sales outcomes.
- 07Behavioral analysis attributed performance gaps to incomplete evidence acquisition, surface-level decision making, and lack of a consistent long-horizon policy.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →