Cooperative game profiles predict multi-agent LLM team performance
A study benchmarking 35 open-weight LLMs across six behavioral economics games finds that a model's cooperative profile reliably predicts how well it performs in real AI-for-Science multi-agent workflows.
Score breakdown
Teams building multi-agent LLM pipelines can use behavioral economics game benchmarks as a cheap pre-screening tool to identify which open-weight models will cooperate effectively before investing in full-scale deployments.
- 0135 open-weight LLMs were benchmarked across six behavioral economics games.
- 02Game-derived cooperative profiles robustly predict performance in AI-for-Science multi-agent tasks.
- 03AI-for-Science tasks involved collaborative data analysis, model building, and scientific report generation under shared budget constraints.
The paper investigates whether stylized behavioral economics games can serve as predictive diagnostics for multi-agent LLM team performance. The researchers benchmarked 35 open-weight LLMs across six such games, which isolate distinct cooperation mechanisms, and derived a "cooperative profile" for each model. They then deployed teams of these LLMs on AI-for-Science workflows — collaborative tasks involving data analysis, model building, and scientific report generation — all conducted under shared resource constraints such as GPU or credit budgets.
The central finding is that game-derived cooperative profiles robustly predict downstream performance across three scientific report outcomes: accuracy, quality, and completion.
The central finding is that game-derived cooperative profiles robustly predict downstream performance across three scientific report outcomes: accuracy, quality, and completion. Models that invested in multiplicative team production rather than pursuing greedy individual strategies consistently produced superior results. Critically, these associations remained significant after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs that cannot be reduced to general capability alone.
The practical implication is significant for practitioners building multi-agent systems: the behavioral games framework offers a fast and inexpensive way to screen LLMs for cooperative fitness before committing to expensive multi-agent deployments. Rather than running full end-to-end evaluations, teams could use these lightweight game-based diagnostics to identify which models are likely to collaborate effectively under shared constraints.
Key facts
- 0135 open-weight LLMs were benchmarked across six behavioral economics games.
- 02Game-derived cooperative profiles robustly predict performance in AI-for-Science multi-agent tasks.
- 03AI-for-Science tasks involved collaborative data analysis, model building, and scientific report generation under shared budget constraints.
- 04Performance was measured across three outcomes: accuracy, quality, and completion.
- 05Models investing in multiplicative team production outperformed those using greedy strategies.
- 06Cooperative disposition held as a distinct LLM property after controlling for multiple confounding factors.
- 07The framework is proposed as a fast, inexpensive diagnostic for screening LLMs before costly multi-agent deployment.