Theo reviews Opus 4.7: impressive but inconsistent
Theo (t3.gg) spent a full day testing Anthropic's newly released Opus 4.7 and found it a genuinely strange model — better at agentic coding than its predecessor but worse on some benchmarks and prone to odd, regressive behavior the longer it runs.
Score breakdown
Developers considering Opus 4.7 for agentic coding pipelines should be aware of its uneven benchmark profile — strong on SWE-bench tasks but weaker on agentic search — and watch for potential quality degradation in long-running sessions before committing it to unsupervised workflows.
- 01Opus 4.7 is now generally available from Anthropic.
- 02Anthropic describes it as a notable improvement over Opus 4.6 for advanced software engineering, especially on difficult long-running tasks.
- 03Opus 4.7 is NOT Anthropic's most capable model — Claude Mythos preview holds that title but is not publicly released.
Anthropic has released Opus 4.7 as a generally available model, positioning it as a meaningful step up from Opus 4.6 for advanced software engineering — especially on the hardest, most complex long-running tasks. According to Anthropic's own framing (as read by Theo), the model pays precise attention to instructions, devises ways to verify its own outputs, and brings substantially improved vision capabilities with higher image resolution support. It also produces higher-quality creative professional outputs like interfaces, slides, and docs.
Opus 4.7 is explicitly not Anthropic's most powerful model — that is Claude Mythos preview, which remains unreleased to the public.
However, Theo's day-long testing revealed a more complicated picture. Opus 4.7 is explicitly not Anthropic's most powerful model — that is Claude Mythos preview, which remains unreleased to the public. In the benchmark comparison charts, Opus 4.7 earns the top score in only two categories, and both happen to be ones where Mythos has no listed score. More tellingly, Opus 4.7 scores *below* Opus 4.6 on several benchmarks, including the Agentic Search bench — a regression Theo says he witnessed firsthand through strange search behavior during testing. He also noted that the model seemed to get "dumber" the more it was used in a session, describing watching it regress in real time. Despite these quirks, Theo says he does see himself using Opus 4.7 going forward, particularly for agentic coding workflows where its SWE-bench Pro and Verified scores suggest genuine strength — though he cautions those benchmarks have faced some contamination concerns.
Key facts
- 01Opus 4.7 is now generally available from Anthropic.
- 02Anthropic describes it as a notable improvement over Opus 4.6 for advanced software engineering, especially on difficult long-running tasks.
- 03Opus 4.7 is NOT Anthropic's most capable model — Claude Mythos preview holds that title but is not publicly released.
- 04The model scores best on SWE-bench Pro and Verified, but underperforms Opus 4.6 on several other benchmarks including the Agentic Search bench.
- 05Opus 4.7 features substantially improved vision with higher image resolution support.
- 06Theo observed the model appearing to degrade in quality the longer a session continued.
- 07Theo notes the agentic coding benchmarks have faced some contamination concerns, reducing confidence in those numbers.