Jun 9, 2026·1 min readResearch Papers

SocSci-Repro-Bench tests AI coding agents on 221 social science reproduction tasks

A new benchmark called SocSci-Repro-Bench, comprising 221 tasks across four disciplines, finds that Claude Code and Codex can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex.

ArXiv·Meysam Alizadeh, Mohsen Mosleh, Fabrizio Gilardi

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The benchmark reveals that frontier coding agents can reliably execute computational social science workflows, while also exposing prompt-framing vulnerabilities that could introduce bias into AI-assisted scientific production.

01SocSci-Repro-Bench contains 221 tasks spanning four disciplines and 13 substantive domains in the social sciences.
02The benchmark distinguishes fully reproducible studies from demonstrably non-reproducible ones (due to missing data) to isolate agent performance.
03Both Claude Code and Codex were evaluated; Claude Code substantially outperformed Codex.

Summary— our read of the original

Alizadeh, Mosleh, and Gilardi introduce SocSci-Repro-Bench, a benchmark designed to systematically evaluate AI coding agents on scientific reproducibility tasks in the social sciences. The benchmark contains 221 tasks drawn from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data — a design choice that allows the authors to isolate agents' reproduction capacity from problems in the source materials themselves, a flaw they identify in prior benchmarks.

Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest results are not primarily driven by memorization.

Evaluating Claude Code and Codex on this benchmark, the authors find that both agents can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest results are not primarily driven by memorization.

The study surfaces two important caveats for deploying agents in scientific workflows. First, providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. Second, agents can be nudged toward confirmatory specification search through subtle prompt framing. The authors conclude that at least some frontier coding agents can serve as reliable executors of computational workflows, while underscoring the need for careful benchmarking and prompt design as AI systems take on larger roles in scientific production.

Key facts

01SocSci-Repro-Bench contains 221 tasks spanning four disciplines and 13 substantive domains in the social sciences.
02The benchmark distinguishes fully reproducible studies from demonstrably non-reproducible ones (due to missing data) to isolate agent performance.
03Both Claude Code and Codex were evaluated; Claude Code substantially outperformed Codex.
04Reproduction rates for both agents considerably exceed those previously reported for general-purpose LLM-based agents on comparable benchmarks.
05Both agents performed strongly on a reasoning task requiring identification of underlying research questions.
06Additional analyses suggest results are not primarily driven by memorization.
07Providing the original paper PDF modestly improves performance but introduces bias on impossible-to-reproduce tasks; subtle prompt framing can nudge agents toward confirmatory specification search.

Topics

#benchmarks #coding-assistant #agent-framework #research #reproducibility

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →

SocSci-Repro-Bench tests AI coding agents on 221 social science reproduction tasks

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.