Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
AAA's single-interface design separates assessment logic from agent implementation, removing the heavy integration burden of existing LLM-centric harnesses and enabling reproducible, cross-agent comparisons that current fragmented benchmarks cannot support.
The benchmark reveals that frontier coding agents can reliably execute computational social science workflows, while also exposing prompt-framing vulnerabilities that could introduce bias into AI-assisted scientific production.