Theo reviews Opus 4.7: impressive but inconsistent
Theo (t3.gg) spent a full day testing Anthropic's newly released Opus 4.7 and found it a genuinely strange model — strong on agentic coding benchmarks but weaker than Opus 4.6 on others, with performance that seemed to degrade noticeably during extended use.
Score breakdown
Developers considering Opus 4.7 for agentic coding pipelines should note its benchmark regressions on search tasks and reported in-session performance degradation before routing long-running or search-heavy workloads to it.
- 01Opus 4.7 is now generally available from Anthropic.
- 02Anthropic describes it as a notable improvement over Opus 4.6 for advanced software engineering, with particular gains on the most difficult tasks.
- 03Opus 4.7 is explicitly described as less broadly capable than Claude Mythos preview, Anthropic's most powerful model.
Anthropic released Opus 4.7 as a generally available model, positioning it as a meaningful step up from Opus 4.6 for advanced software engineering — particularly on the hardest coding tasks. According to Anthropic's own framing, users can hand off complex, long-running work to Opus 4.7 with confidence, as the model pays close attention to instructions and attempts to verify its own outputs before reporting back. Vision capabilities also received an upgrade, with the model able to process images at greater resolution and produce higher-quality interfaces, slides, and documents.
The only two categories where Opus 4.7 leads are ones where Mythos has no listed score.
However, Theo's review complicates that picture considerably. He notes that Opus 4.7 is not Anthropic's best model — Claude Mythos preview holds that position — and that the benchmark chart for Opus 4.7 contains fewer bold (best-in-class) numbers than he's seen for any comparable release. The only two categories where Opus 4.7 leads are ones where Mythos has no listed score. More critically, Opus 4.7 scores *below* Opus 4.6 on several benchmarks, including the Agentic Search bench, which Theo says matches his lived experience: the model made strange and questionable search decisions during testing. He also observed what he described as real-time performance regression — the model appearing to get worse the more it was used in a session — making it one of the "weirdest models ever released" in his assessment, even as he acknowledged he does see himself using it going forward.
Key facts
- 01Opus 4.7 is now generally available from Anthropic.
- 02Anthropic describes it as a notable improvement over Opus 4.6 for advanced software engineering, with particular gains on the most difficult tasks.
- 03Opus 4.7 is explicitly described as less broadly capable than Claude Mythos preview, Anthropic's most powerful model.
- 04Opus 4.7 scores worse than Opus 4.6 on several benchmarks, including the Agentic Search bench.
- 05Opus 4.7 leads benchmark scores only in two categories where Mythos has no listed score; it shows better results on SWE-bench Pro and Verified.
- 06Vision capabilities improved, with the model processing images at greater resolution.
- 07Theo observed apparent performance degradation during extended use, describing the model as getting 'dumber the more you do it.'