SocialGrid benchmark tests LLM agents on social reasoning
SocialGrid is a new embodied multi-agent benchmark inspired by Among Us that reveals even the strongest tested LLM agent (GPT-OSS-120B) achieves below 60% accuracy on task completion and planning, while social deception detection remains near random chance.
Score breakdown
Use SocialGrid's Planning Oracle and fine-grained metrics to pinpoint whether your agent's failures stem from navigation deficits or genuine social reasoning gaps — a critical distinction when building multi-agent systems that must detect or model deceptive behavior.
- 01SocialGrid is an embodied multi-agent benchmark inspired by Among Us, evaluating LLM agents on planning, task execution, and social reasoning.
- 02GPT-OSS-120B, the strongest open model tested, achieves below 60% accuracy in task completion and planning.
- 03Agents frequently get stuck in repetitive behaviors or fail to navigate basic obstacles.
SocialGrid, introduced by Hikaru Shindo, Hanzhao Lin, and Lukas Helff, is an embodied multi-agent environment designed to stress-test LLM agents across three dimensions: planning, task execution, and social reasoning. Inspired by the social deduction game Among Us, the benchmark places agents in scenarios that require not only navigating a physical environment and completing tasks, but also detecting and reasoning about deceptive behavior from other agents. Evaluations reveal that even GPT-OSS-120B — described as the strongest open model tested — falls below 60% accuracy on task completion and planning, with agents frequently exhibiting repetitive behaviors or failing to clear basic navigational obstacles.
A key methodological contribution is the optional Planning Oracle, which removes navigation as a confounding variable so that social reasoning can be evaluated in isolation.
A key methodological contribution is the optional Planning Oracle, which removes navigation as a confounding variable so that social reasoning can be evaluated in isolation. Even with this assistance, agents fail to detect deception at near-random chance levels regardless of model scale, relying on shallow heuristics rather than accumulating and reasoning over behavioral evidence across time. This finding suggests that current LLMs lack the deeper theory-of-mind capabilities needed for genuine social intelligence in multi-agent settings.
To support iterative improvement, SocialGrid provides automatic failure analysis and fine-grained metrics that help developers diagnose specific weaknesses in their agents. A competitive leaderboard using Elo ratings derived from adversarial league play is also established, enabling ongoing comparison across models and agent designs.
Key facts
- 01SocialGrid is an embodied multi-agent benchmark inspired by Among Us, evaluating LLM agents on planning, task execution, and social reasoning.
- 02GPT-OSS-120B, the strongest open model tested, achieves below 60% accuracy in task completion and planning.
- 03Agents frequently get stuck in repetitive behaviors or fail to navigate basic obstacles.
- 04An optional Planning Oracle isolates social reasoning from planning deficits by handling navigation.
- 05Even with planning assistance, agents detect deception at near-random chance regardless of model scale.
- 06Agents rely on shallow heuristics rather than accumulating behavioral evidence over time.
- 07