Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
ReproRepo replaces the manual curation bottleneck of prior reproducibility benchmarks with a scalable, naturally occurring signal from GitHub issues, enabling ongoing large-scale evaluation of LLM agents on real-world ML paper auditing.
AAA's single-interface design separates assessment logic from agent implementation, removing the heavy integration burden of existing LLM-centric harnesses and enabling reproducible, cross-agent comparisons that current fragmented benchmarks cannot support.
The benchmark reveals that frontier coding agents can reliably execute computational social science workflows, while also exposing prompt-framing vulnerabilities that could introduce bias into AI-assisted scientific production.