Alem benchmark exposes coordination gap in frontier LLMs
Researchers introduce Alem, a JAX-based open-ended multi-agent benchmark built on Craftax-like dynamics, revealing that 13 modern LLMs average only ~6% normalised return and that individual task competence does not imply coordination competence.
Score breakdown
Alem makes multi-agent coordination a measurable, distinct bottleneck — separate from single-agent capabilities — for the first time in a long-horizon, open-ended setting, providing a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans.
- 01Alem is a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics.
- 0213 modern LLMs were evaluated zero-shot in homogeneous teams, with trained MARL agents as reference points.
- 03LLM agents average only ~6% normalised return on Alem, far from solving the benchmark.
Kale-ab Abebe Tessera, Andras Szecsenyi, and Cameron Barker present Alem, a JAX-based benchmark for open-ended multi-agent coordination that addresses a gap in existing evaluations, which tend to focus on single-agent tasks, short interactions, or highly structured multi-agent settings. Alem is built on Craftax-like dynamics and features a long-horizon survival world with exploration, crafting, trading, and combat. It incorporates procedurally generated coordination tasks, soft specialisation, communication channels, and controllable coordination difficulty, making it possible to isolate and measure coordination as a distinct capability.
The authors evaluate 13 modern LLMs zero-shot within homogeneous teams, using trained MARL agents as reference points.
The authors evaluate 13 modern LLMs zero-shot within homogeneous teams, using trained MARL agents as reference points. Current LLM agents average only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward — a contrast the authors use to argue that individual task competence does not imply coordination competence.
Ablation studies identify communication as the largest single contributor to coordination performance, with memory and reasoning providing additional benefit when used to maintain multi-step plans. The paper concludes that coordination is a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities, and that Alem provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.
Key facts
- 01Alem is a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics.
- 0213 modern LLMs were evaluated zero-shot in homogeneous teams, with trained MARL agents as reference points.
- 03LLM agents average only ~6% normalised return on Alem, far from solving the benchmark.
- 04On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps.
- 05GPT-5.4-High achieves strong base-task reward but much lower coordination reward.
- 06Ablations identify communication as the largest contributor to coordination performance.
- 07Memory and reasoning help coordination when used to maintain multi-step plans.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →