Archive·3 stories·Jun 2026 – Jun 2026·Updated 11:49 UTC

Archive

Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.

Total · all-time3

Avg score6.8▲ 1.1 vs all tags

Verdict

Surging

Stories / monthPeak 3

Jul 25Oct 25Jan 26Apr 26Jun 26

3 storiesShowing 1–3Page 1 of 1

Sort

NewestScore

Density

StandardCompact

W251 story · Jun 15–21

6.7
Jun 16, 2026

·

Shanda Li, Qiuhong Anna Wei, Jingwu Tang

·Research Papers

·1 min read

ReproRepo benchmarks LLM agents on real-world reproducibility audits

ReproRepo replaces the manual curation bottleneck of prior reproducibility benchmarks with a scalable, naturally occurring signal from GitHub issues, enabling ongoing large-scale evaluation of LLM agents on real-world ML paper auditing.

Read at source ↗

W242 stories · Jun 8–14

7.0
Jun 11, 2026·Xiaoyuan Liu, Jianhong Tu, Yuqi Chen·Research Papers·1 min read
AgentBeats proposes agent-run benchmarking via A2A and MCP protocols
AAA's single-interface design separates assessment logic from agent implementation, removing the heavy integration burden of existing LLM-centric harnesses and enabling reproducible, cross-agent comparisons that current fragmented benchmarks cannot support.
Read at source ↗
6.7
Jun 9, 2026·Meysam Alizadeh, Mohsen Mosleh, Fabrizio Gilardi·Research Papers·1 min read
SocSci-Repro-Bench tests AI coding agents on 221 social science reproduction tasks
The benchmark reveals that frontier coding agents can reliably execute computational social science workflows, while also exposing prompt-framing vulnerabilities that could introduce bias into AI-assisted scientific production.
Read at source ↗

Archive

ReproRepo benchmarks LLM agents on real-world reproducibility audits

AgentBeats proposes agent-run benchmarking via A2A and MCP protocols

SocSci-Repro-Bench tests AI coding agents on 221 social science reproduction tasks