ALMANAC dataset benchmarks LLMs on human collaborative mental models
Researchers introduce ALMANAC, a dataset of 2,987 action-level mental model annotations built from the Map Task, designed to evaluate how well LLMs can simulate human collaborative behavior and infer underlying mental models.
Score breakdown
ALMANAC provides the first dataset with action-level mental model annotations grounded in authentic human collaboration, offering a concrete benchmark for evaluating whether LLM agents can simulate the reasoning alignment that effective human collaboration requires.
- 01ALMANAC stands for Action-Level Mental model ANnotations for Agent Collaboration.
- 02The dataset contains 2,987 collaboration actions drawn from the Map Task, a classic dyadic routing task from social science.
- 03Each action is paired with theory-informed mental model annotations covering self-reasoning, perceived partner intent, and perceived team goal.
Jiaju Chen, Yuxuan Lu, and Jiayi Su argue that while LLM agents have grown capable of multi-step reasoning, planning, and tool use, they remain primarily optimized for task completion rather than genuine collaboration. Effective human collaboration requires continuously maintaining and aligning mental models — tracking one's own reasoning, a partner's intentions, and shared goals throughout an interaction — and the research community has lacked authentic human data annotated at this level of granularity.
To address this, the authors introduce ALMANAC (Action-Level Mental model ANnotations for Agent Collaboration), constructed from the Map Task, a well-established dyadic routing task from social science.
To address this, the authors introduce ALMANAC (Action-Level Mental model ANnotations for Agent Collaboration), constructed from the Map Task, a well-established dyadic routing task from social science. The dataset contains 2,987 collaboration actions, each paired with theory-informed annotations recording three dimensions: self-reasoning, perceived partner intent, and perceived team goal. Six LLMs are benchmarked on two tasks — predicting humans' next-turn behavior and inferring their mental models — with results framing ALMANAC as an evaluation resource for measuring process-level collaborative competence in agents.
Key facts
- 01ALMANAC stands for Action-Level Mental model ANnotations for Agent Collaboration.
- 02The dataset contains 2,987 collaboration actions drawn from the Map Task, a classic dyadic routing task from social science.
- 03Each action is paired with theory-informed mental model annotations covering self-reasoning, perceived partner intent, and perceived team goal.
- 04The authors argue today's LLM agents are primarily optimized for task completion and lack process-level collaborative competence.
- 05Six LLMs are benchmarked on predicting humans' next-turn behavior and mental models.
- 06The paper is authored by Jiaju Chen, Yuxuan Lu, and Jiayi Su.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →