Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
RetailBench exposes that current LLMs cannot sustain coherent long-horizon decision-making in economically grounded environments, with most models failing to complete even a 180-day simulation and all falling substantially short of an oracle policy on net worth and sales.
Emergence World is the first platform the paper describes as purpose-built to make long-horizon multi-agent dynamics — behavioral drift, cross-vendor influence, and emergent governance — measurable, filling a gap left by short-horizon benchmarks that cannot observe these phenomena.