Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
The talk identifies a concrete regression in evaluation rigor — from data-science-grounded practices to ad hoc LLM-graded metrics — and maps five specific failure modes that teams building on agents are repeating at scale.
EvoDS directly addresses two core failure modes of current LLM-based data science automation — static skill sets and context overflow — with a system that learns to expand its own capabilities and manage long-horizon context, achieving a 28.9% average improvement over existing open-source agents across four benchmarks.