Anthropic reverses invisible Claude Fable 5 safeguards after researcher backlash
Anthropic reversed a controversial policy in Claude Fable 5 that silently limited effectiveness for requests related to frontier LLM development, announcing that such safeguards will now be made visible and flagged requests will fall back to Opus 4.8.
Score breakdown
The reversal replaces a silent, undetectable capability restriction with a visible fallback and explicit API refusal reasons, meaning AI researchers can now see when and why Claude Fable 5 is limiting their requests rather than receiving silently degraded responses.
- 01Anthropic's Claude Fable 5 system card contained a policy to silently "limit effectiveness" for requests targeting frontier LLM development without notifying users.
- 02Following public outcry, Anthropic told Wired: "We made the wrong tradeoff and we apologize for not getting the balance right."
- 03Flagged requests will now visibly fall back to Opus 4.8, matching the approach used for cyber and bio safeguards.
Anthropic has reversed a policy in Claude Fable 5 (also referred to as Claude Mythos) that drew widespread criticism from AI researchers. The policy, buried in the model's system card, described a mechanism that would identify "requests targeting frontier LLM development" and silently "limit effectiveness" — without notifying the user that any restriction had been applied. The lack of transparency was the core concern: researchers could have had their work quietly degraded with no indication that a safeguard had triggered.
After significant public backlash and a report by Maxwell Zeff at Wired, Anthropic issued a statement acknowledging the error.
After significant public backlash and a report by Maxwell Zeff at Wired, Anthropic issued a statement acknowledging the error. In a follow-up post from @ClaudeDevs on Twitter, the company provided more detail: starting that week, flagged requests would visibly fall back to Opus 4.8 — consistent with how Anthropic handles safeguards for cyber and bio categories — and on the API, flagged requests would return an explicit reason for refusal, with server-side fallback support coming within days. Anthropic explained that invisible safeguards had been chosen to enable a faster, narrower deployment with very few false positives, but conceded that the tradeoff was wrong and that users should always have visibility into active safeguards. The post notes that while the move to make safeguards visible is welcome, the safeguard category itself — restricting frontier LLM development requests — remains in place.
Key facts
- 01Anthropic's Claude Fable 5 system card contained a policy to silently "limit effectiveness" for requests targeting frontier LLM development without notifying users.
- 02Following public outcry, Anthropic told Wired: "We made the wrong tradeoff and we apologize for not getting the balance right."
- 03Flagged requests will now visibly fall back to Opus 4.8, matching the approach used for cyber and bio safeguards.
- 04On the API, flagged requests will return an explicit reason for refusal; server-side fallback support was described as coming within days.
- 05Anthropic said it originally chose invisible safeguards to ship quickly with fewer false positives, but acknowledged this was the wrong call.
- 06The reversal makes the safeguards visible but does not eliminate the frontier LLM development restriction category entirely.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →