Claude Opus 4.7 scores higher on OpenCode than Claude Code harness
Artificial Analysis's Coding Agent Leaderboard benchmarks real-world coding agent performance across harnesses, showing that Claude Opus 4.7 scores higher when run on OpenCode than on Claude Code.
Score breakdown
The harness comparison shows that the same model (Claude Opus 4.7) produces meaningfully different benchmark scores depending on which coding-agent harness runs it, indicating that harness choice — not just model choice — affects real-world coding agent performance.
- 01The Artificial Analysis Coding Agent Index is a composite average pass@1 across three benchmarks: DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.
- 02DeepSWE covers 113 software engineering tasks (by DataCurve); Terminal-Bench v2 covers 84 agentic terminal tasks (by Laude Institute); SWE-Atlas-QnA covers 124 technical Q&A tasks (by Scale AI).
- 03The index averages pass@1 across 3 runs of each benchmark and was recently updated to v1.1.
Artificial Analysis maintains a Coding Agent Leaderboard that evaluates coding agents on three benchmarks: DeepSWE (software engineering tasks, 113 tasks, by DataCurve), Terminal-Bench v2 (agentic terminal use, 84 tasks, by Laude Institute), and SWE-Atlas-QnA (technical Q&A, 124 tasks, by Scale AI). The composite Artificial Analysis Coding Agent Index represents the average pass@1 across three runs of each benchmark, and was recently updated to v1.1.
A key feature of the leaderboard is a harness comparison that holds the underlying model constant and measures how it performs across different agent harnesses.
A key feature of the leaderboard is a harness comparison that holds the underlying model constant and measures how it performs across different agent harnesses. For Claude Opus 4.7 specifically, this comparison spans Claude Code, Cursor CLI, and OpenCode — and the data shows OpenCode outperforming Claude Code on the composite index for that model. The leaderboard also tracks token usage in detail, breaking down input, cached input, and output tokens per task, and notes that prompt cache hit rates can vary significantly depending on provider routing, which can materially affect effective cost.
Key facts
- 01The Artificial Analysis Coding Agent Index is a composite average pass@1 across three benchmarks: DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.
- 02DeepSWE covers 113 software engineering tasks (by DataCurve); Terminal-Bench v2 covers 84 agentic terminal tasks (by Laude Institute); SWE-Atlas-QnA covers 124 technical Q&A tasks (by Scale AI).
- 03The index averages pass@1 across 3 runs of each benchmark and was recently updated to v1.1.
- 04A harness comparison holds Claude Opus 4.7 constant and compares its scores across Claude Code, Cursor CLI, and OpenCode.
- 05OpenCode produces a higher Coding Agent Index score for Claude Opus 4.7 than Claude Code does.
- 06Token usage tracking breaks down input, cached input, and output tokens, with cache hit rates noted as variable due to provider routing.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →