SocSci-Repro-Bench tests AI coding agents on 221 social science reproduction tasks
A new benchmark called SocSci-Repro-Bench, comprising 221 tasks across four disciplines, finds that Claude Code and Codex can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex.
Score breakdown
The benchmark reveals that frontier coding agents can reliably execute computational social science workflows, while also exposing prompt-framing vulnerabilities that could introduce bias into AI-assisted scientific production.
- 01SocSci-Repro-Bench contains 221 tasks spanning four disciplines and 13 substantive domains in the social sciences.
- 02The benchmark distinguishes fully reproducible studies from demonstrably non-reproducible ones (due to missing data) to isolate agent performance.
- 03Both Claude Code and Codex were evaluated; Claude Code substantially outperformed Codex.
Alizadeh, Mosleh, and Gilardi introduce SocSci-Repro-Bench, a benchmark designed to systematically evaluate AI coding agents on scientific reproducibility tasks in the social sciences. The benchmark contains 221 tasks drawn from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data — a design choice that allows the authors to isolate agents' reproduction capacity from problems in the source materials themselves, a flaw they identify in prior benchmarks.
Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest results are not primarily driven by memorization.
Evaluating Claude Code and Codex on this benchmark, the authors find that both agents can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest results are not primarily driven by memorization.
The study surfaces two important caveats for deploying agents in scientific workflows. First, providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. Second, agents can be nudged toward confirmatory specification search through subtle prompt framing. The authors conclude that at least some frontier coding agents can serve as reliable executors of computational workflows, while underscoring the need for careful benchmarking and prompt design as AI systems take on larger roles in scientific production.
Key facts
- 01SocSci-Repro-Bench contains 221 tasks spanning four disciplines and 13 substantive domains in the social sciences.
- 02The benchmark distinguishes fully reproducible studies from demonstrably non-reproducible ones (due to missing data) to isolate agent performance.
- 03Both Claude Code and Codex were evaluated; Claude Code substantially outperformed Codex.
- 04Reproduction rates for both agents considerably exceed those previously reported for general-purpose LLM-based agents on comparable benchmarks.
- 05Both agents performed strongly on a reasoning task requiring identification of underlying research questions.
- 06Additional analyses suggest results are not primarily driven by memorization.
- 07Providing the original paper PDF modestly improves performance but introduces bias on impossible-to-reproduce tasks; subtle prompt framing can nudge agents toward confirmatory specification search.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →