Apr 17, 2026·1 min readResearch Papers

SocialGrid benchmark tests LLM agents on social reasoning

SocialGrid is a new embodied multi-agent benchmark inspired by Among Us that reveals even the strongest tested LLM agent (GPT-OSS-120B) achieves below 60% accuracy on task completion and planning, while social deception detection remains near random chance.

ArXiv·Hikaru Shindo, Hanzhao Lin, Lukas Helff

Read at source

Composite

6.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Use SocialGrid's Planning Oracle and fine-grained metrics to pinpoint whether your agent's failures stem from navigation deficits or genuine social reasoning gaps — a critical distinction when building multi-agent systems that must detect or model deceptive behavior.

01SocialGrid is an embodied multi-agent benchmark inspired by Among Us, evaluating LLM agents on planning, task execution, and social reasoning.
02GPT-OSS-120B, the strongest open model tested, achieves below 60% accuracy in task completion and planning.
03Agents frequently get stuck in repetitive behaviors or fail to navigate basic obstacles.

Summary— our read of the original

SocialGrid, introduced by Hikaru Shindo, Hanzhao Lin, and Lukas Helff, is an embodied multi-agent environment designed to stress-test LLM agents across three dimensions: planning, task execution, and social reasoning. Inspired by the social deduction game Among Us, the benchmark places agents in scenarios that require not only navigating a physical environment and completing tasks, but also detecting and reasoning about deceptive behavior from other agents. Evaluations reveal that even GPT-OSS-120B — described as the strongest open model tested — falls below 60% accuracy on task completion and planning, with agents frequently exhibiting repetitive behaviors or failing to clear basic navigational obstacles.

A key methodological contribution is the optional Planning Oracle, which removes navigation as a confounding variable so that social reasoning can be evaluated in isolation.

A key methodological contribution is the optional Planning Oracle, which removes navigation as a confounding variable so that social reasoning can be evaluated in isolation. Even with this assistance, agents fail to detect deception at near-random chance levels regardless of model scale, relying on shallow heuristics rather than accumulating and reasoning over behavioral evidence across time. This finding suggests that current LLMs lack the deeper theory-of-mind capabilities needed for genuine social intelligence in multi-agent settings.

To support iterative improvement, SocialGrid provides automatic failure analysis and fine-grained metrics that help developers diagnose specific weaknesses in their agents. A competitive leaderboard using Elo ratings derived from adversarial league play is also established, enabling ongoing comparison across models and agent designs.

Key facts

01SocialGrid is an embodied multi-agent benchmark inspired by Among Us, evaluating LLM agents on planning, task execution, and social reasoning.
02GPT-OSS-120B, the strongest open model tested, achieves below 60% accuracy in task completion and planning.
03Agents frequently get stuck in repetitive behaviors or fail to navigate basic obstacles.
04An optional Planning Oracle isolates social reasoning from planning deficits by handling navigation.
05Even with planning assistance, agents detect deception at near-random chance regardless of model scale.
06Agents rely on shallow heuristics rather than accumulating behavioral evidence over time.
07SocialGrid includes automatic failure analysis, fine-grained metrics, and an Elo-rated leaderboard from adversarial league play.

Topics

#benchmarks #multi-agent #agent-framework #reasoning #safety

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 20, 2026 · 13:29 UTC. How this works →

SocialGrid benchmark tests LLM agents on social reasoning

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics