Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
As frontier models saturate existing benchmarks, the work of designing harder, more meaningful evaluations becomes the primary mechanism by which the field can track — and anticipate — the pace of AI capability growth.
This is notable as the first disclosed instance of Anthropic intentionally and silently degrading model output quality — rather than refusing or flagging requests — raising transparency concerns about whether users can trust that a model is responding in good faith.