ReproRepo benchmarks LLM agents on real-world reproducibility audits
ReproRepo is a scalable framework that uses human-raised GitHub issues as natural supervision to evaluate whether LLM agents can identify reproducibility problems in ML research papers.
Score breakdown
ReproRepo replaces the manual curation bottleneck of prior reproducibility benchmarks with a scalable, naturally occurring signal from GitHub issues, enabling ongoing large-scale evaluation of LLM agents on real-world ML paper auditing.
- 01ReproRepo uses human-raised GitHub issues as naturally occurring supervision for reproducibility blockers, reducing reliance on manual curation.
- 02The framework is instantiated on 1,149 recent ML papers from major conferences.
- 03Four frontier model-agent configurations are evaluated.
Shanda Li, Qiuhong Anna Wei, and Jingwu Tang present ReproRepo, a framework designed to scale reproducibility auditing of ML research by replacing expensive manual evaluation with human-raised GitHub issues as naturally occurring supervision. Prior benchmarks for assessing LLM agents on reproducibility tasks have struggled to scale because they depend on substantial manual effort for both data curation and evaluation; ReproRepo addresses this by treating real GitHub issues filed against paper repositories as ground-truth signals for reproduction blockers.
Further analysis reveals that agents are particularly effective at surfacing visible failures and pinpointing the correct semantic region of a problem, but remain insufficient for exact localization of issues.
The framework is instantiated on 1,149 recent machine learning papers drawn from major conferences, and four frontier model-agent configurations are evaluated against it. The top-performing configuration — Codex with GPT-5.5 — identifies at least one semantically related human-reported blocker for approximately 90% of papers in the study, and notably achieves this without executing any code. Further analysis reveals that agents are particularly effective at surfacing visible failures and pinpointing the correct semantic region of a problem, but remain insufficient for exact localization of issues. The authors release their code at https://github.com/LithiumDA/ReproRepo, positioning ReproRepo as a reusable, scalable platform for future evaluations of LLM agents on real-world reproducibility auditing tasks.
Key facts
- 01ReproRepo uses human-raised GitHub issues as naturally occurring supervision for reproducibility blockers, reducing reliance on manual curation.
- 02The framework is instantiated on 1,149 recent ML papers from major conferences.
- 03Four frontier model-agent configurations are evaluated.
- 04The best agent — Codex with GPT-5.5 — surfaces at least one semantically related human-reported blocker for ~90% of papers.
- 05Agents achieve this without executing any code.
- 06Agents are effective at identifying visible failures and the correct semantic region, but fall short on exact localization.
- 07Code is publicly released at https://github.com/LithiumDA/ReproRepo.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →