Fable 5 jailbroken in under 24 hours, leaking Anthropic's full system prompt
Less than 24 hours after Anthropic launched Claude Fable 5 — its first publicly available Mythos-class model — researcher Pliny the Liberator (@elder_plinius) claimed to have broken through its safety classifiers and published what he alleged was the model's full ~120,000-character system prompt on GitHub.
Score breakdown
The breach exposed Anthropic's full behavioral scaffolding for a model the company had previously deemed too dangerous for public access, turning the system prompt into a publicly available blueprint of its alignment strategy less than a day after launch.
- 01Claude Fable 5 launched June 9, 2026 as Anthropic's first publicly available Mythos-class model.
- 02Fable 5 and the restricted Claude Mythos 5 share the same underlying weights; the difference is a classifier safety layer on top.
- 03The classifier layer intercepts queries in four domains — cybersecurity, biology, chemistry, and model distillation — and reroutes them to Claude Opus 4.8.
Anthropic launched Claude Fable 5 on June 9, 2026, positioning it as the first Mythos-class model made safe for general public use. The model shares underlying weights with the restricted Claude Mythos 5 — accessible only to approved organizations through Project Glasswing — but ships with a classifier layer that intercepts queries in four domains (cybersecurity, biology, chemistry, and model distillation) and silently reroutes them to Claude Opus 4.8. Anthropic had spent two months restricting Mythos-class access due to concerns about the model's vulnerability-identification capabilities and the possibility of approaching recursive self-improvement. After more than 1,000 hours of internal and external red-teaming with no universal jailbreaks found, the company concluded a public release was achievable. The launch also coincided with Anthropic quietly filing IPO paperwork.
Pricing is $10 per million input tokens and $50 per million output tokens, with a 1M token context window and 128K output ceiling.
Fable 5's benchmarks are notable: it scores 80.3% on SWE-Bench Pro (compared to 69.2% for Opus 4.8 and 58.6% for GPT-5.5), 64.5% on Humanity's Last Exam with tools (versus 52.2% for GPT-5.5), ranks #1 on Cognition's FrontierCode evaluation, and sits second overall across 123 models on BenchLM. Pricing is $10 per million input tokens and $50 per million output tokens, with a 1M token context window and 128K output ceiling. The model was designed for long-horizon agentic tasks running hours or days.
The safety confidence lasted roughly 24 hours. On June 10, Pliny the Liberator (@elder_plinius) announced on X that he had broken through the classifiers and posted a GitHub link to what he claimed was Fable 5's full system prompt — approximately 120,000 characters of internal instructions defining the model's behavior, refusals, and reasoning. The article by Syed Ahmer Shah argues the system prompt leak is the most consequential element of the story, as it functions as a reverse-engineered map of Anthropic's alignment strategy — now accessible to safety researchers, adversarial researchers, and others. The source text is truncated before the full account of subsequent events concludes.
Key facts
- 01Claude Fable 5 launched June 9, 2026 as Anthropic's first publicly available Mythos-class model.
- 02Fable 5 and the restricted Claude Mythos 5 share the same underlying weights; the difference is a classifier safety layer on top.
- 03The classifier layer intercepts queries in four domains — cybersecurity, biology, chemistry, and model distillation — and reroutes them to Claude Opus 4.8.
- 04Anthropic conducted over 1,000 hours of internal and external red-teaming and reported no universal jailbreaks found before launch.
- 05Within 24 hours of launch, Pliny the Liberator (@elder_plinius) claimed to have bypassed all safety classifiers and posted an alleged full system prompt (~120,000 characters) to GitHub.
- 06Fable 5 scores 80.3% on SWE-Bench Pro, versus 69.2% for Opus 4.8 and 58.6% for GPT-5.5.
- 07Pricing is $10 per million input tokens and $50 per million output tokens, with a 1M token context window and 128K output ceiling.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →