Fable 5 tops Ship-Bench with a 96.49 average, beating best-in-class composite
Jason Agostoni ran Anthropic's Fable 5 through Ship-Bench against a composite "Best-in-Class" lineup and found it averaged 96.49 across all five SDLC roles, edging out the previous top average of 94.18 set by DeepSeek v4 Pro — though the run cost nearly $180 in API spend.
Score breakdown
Fable 5 is the first model to outscore a cherry-picked composite of best-in-class specialists across a full multi-turn SDLC workflow on Ship-Bench, though the nearly $180 API cost the article documents frames its viability as an open cost-versus-reliability question.
- 01Fable 5 scored an overall average of 96.49 on Ship-Bench, beating the composite Best-in-Class average of 95.87
- 02It achieved a perfect 100 in the Architect phase — the first model to do so in this benchmark
- 03Fable 5 scored 89.29 in the Reviewer role, a new high bar; the composite Best-in-Class (DeepSeek v4 Pro) scored 85.00 in that phase
Jason Agostoni ran Anthropic's Fable 5 through Ship-Bench `v1` (commit `0e7cc28`) on a Windows 11 machine running Node `v24`, using Claude Code `v2.1.177` as the harness and Opus 4.8 medium as the judge in LLM-plus-human-review evaluation mode. The benchmark task was a simplified knowledge base app, the same task used across all prior runs to ensure comparability. The composite "Best-in-Class" opponent was not a single model but an aggregate of the best single-role scores recorded to date: Sonnet 4.6 for Architecture, DeepSeek v4 Pro for UX Designer and Developer, and Gemini 3.5 Flash for Planner.
The article frames the viability question as whether near-flawless reliability justifies that level of API spend.
Fable 5 scored 100 in Architecture (the first model to achieve a perfect score in that phase), 98.57 in UX Designer, 97.20 in Planner, 97.37 in Developer, and 89.29 in Reviewer, for an overall average of 96.49 against the composite's 95.87. It narrowly lost the UX, Planner, and Developer rounds to their respective specialists, but its perfect Architecture score and dominant Reviewer score — where it set a new high bar in a phase all other models have struggled with — pushed its total above the composite. The article attributes Fable 5's Architecture dominance to precise dependency version pinning and a local-first, zero-service SQLite design with WAL and `busy_timeout`, and notes that this level of upfront planning likely underpins its reputation for one-shotting application builds.
The primary practical concern raised is cost: the run totaled nearly $180, driven by massive cache token volumes during the implementation phase. The article frames the viability question as whether near-flawless reliability justifies that level of API spend.
Key facts
- 01Fable 5 scored an overall average of 96.49 on Ship-Bench, beating the composite Best-in-Class average of 95.87
- 02It achieved a perfect 100 in the Architect phase — the first model to do so in this benchmark
- 03Fable 5 scored 89.29 in the Reviewer role, a new high bar; the composite Best-in-Class (DeepSeek v4 Pro) scored 85.00 in that phase
- 04The previous top overall average was 94.18, held by DeepSeek v4 Pro
- 05The composite Best-in-Class lineup drew from Sonnet 4.6, DeepSeek v4 Pro, and Gemini 3.5 Flash across different roles
- 06The run cost nearly $180, driven by large cache token volumes during the implementation phase
- 07The judge model used was Opus 4.8 medium, with evaluation combining LLM scoring and human review
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 15, 2026 · 11:57 UTC. How this works →