Jun 6, 2026·1 min readResearch Papers

Alem benchmark exposes coordination gap in frontier LLMs

Researchers introduce Alem, a JAX-based open-ended multi-agent benchmark built on Craftax-like dynamics, revealing that 13 modern LLMs average only ~6% normalised return and that individual task competence does not imply coordination competence.

ArXiv·Kale-ab Abebe Tessera, Andras Szecsenyi, Cameron Barker

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Alem makes multi-agent coordination a measurable, distinct bottleneck — separate from single-agent capabilities — for the first time in a long-horizon, open-ended setting, providing a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans.

01Alem is a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics.
0213 modern LLMs were evaluated zero-shot in homogeneous teams, with trained MARL agents as reference points.
03LLM agents average only ~6% normalised return on Alem, far from solving the benchmark.

Summary— our read of the original

Kale-ab Abebe Tessera, Andras Szecsenyi, and Cameron Barker present Alem, a JAX-based benchmark for open-ended multi-agent coordination that addresses a gap in existing evaluations, which tend to focus on single-agent tasks, short interactions, or highly structured multi-agent settings. Alem is built on Craftax-like dynamics and features a long-horizon survival world with exploration, crafting, trading, and combat. It incorporates procedurally generated coordination tasks, soft specialisation, communication channels, and controllable coordination difficulty, making it possible to isolate and measure coordination as a distinct capability.

The authors evaluate 13 modern LLMs zero-shot within homogeneous teams, using trained MARL agents as reference points.

The authors evaluate 13 modern LLMs zero-shot within homogeneous teams, using trained MARL agents as reference points. Current LLM agents average only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward — a contrast the authors use to argue that individual task competence does not imply coordination competence.

Ablation studies identify communication as the largest single contributor to coordination performance, with memory and reasoning providing additional benefit when used to maintain multi-step plans. The paper concludes that coordination is a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities, and that Alem provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.

Key facts

01Alem is a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics.
0213 modern LLMs were evaluated zero-shot in homogeneous teams, with trained MARL agents as reference points.
03LLM agents average only ~6% normalised return on Alem, far from solving the benchmark.
04On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps.
05GPT-5.4-High achieves strong base-task reward but much lower coordination reward.
06Ablations identify communication as the largest contributor to coordination performance.
07Memory and reasoning help coordination when used to maintain multi-step plans.

Topics

#benchmarks #multi-agent #agent-framework #reasoning #open-source

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →

Alem benchmark exposes coordination gap in frontier LLMs

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.