Claw-SWE-Bench adds harness and cost axes to coding-agent evaluation
Mengyu Zheng, Kai Han, and Boxun Li introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses fairly comparable on coding tasks, revealing that adapter design can swing Pass@1 from 19.1% to 73.4% on the same model.
Score breakdown
The benchmark demonstrates that adapter/harness design can swing Pass@1 by over 54 percentage points on the same model, showing that existing SWE-bench evaluations of general-purpose agents conflate harness quality with model capability — a gap Claw-SWE-Bench is designed to isolate.
- 01Claw-SWE-Bench is a multilingual SWE-bench-style benchmark and adapter protocol for evaluating OpenClaw-style agent harnesses on coding tasks.
- 02The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories.
- 03Claw-SWE-Bench Lite is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns.
Mengyu Zheng, Kai Han, and Boxun Li present Claw-SWE-Bench, a benchmark and adapter protocol that addresses a gap in coding-agent evaluation: general-purpose agents such as OpenClaw do not natively satisfy the Docker workspace, patch, and prediction contract required by SWE-bench scoring. The framework introduces a standardized set of constraints — fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator — so that heterogeneous agent harnesses (called "claws") can be compared on equal footing. The full benchmark draws 350 GitHub issue-resolution instances from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup, spanning 8 programming languages and 43 repositories. A companion Claw-SWE-Bench Lite subset of 80 instances is selected via a cost-aware, rank-aware procedure over 17 calibration columns for faster iteration.
Across an OpenClaw × nine-model sweep and a five-claw × two-model sweep, model choice shifts Pass@1 by 29.4 percentage points and harness choice shifts it by 27.4 pp under fixed models.
The paper's central empirical finding is that adapter design dominates raw model capability within the OpenClaw framework: a minimal direct-diff adapter achieves only 19.1% Pass@1, while the full adapter reaches 73.4% Pass@1 — both using the same GLM 5.1 backbone. Across an OpenClaw × nine-model sweep and a five-claw × two-model sweep, model choice shifts Pass@1 by 29.4 percentage points and harness choice shifts it by 27.4 pp under fixed models. The authors also highlight that systems with comparable accuracy can differ substantially in total API cost, motivating cost accounting as a first-class evaluation dimension alongside accuracy. The benchmark data and Lite subset are publicly available on GitHub and Hugging Face.
Key facts
- 01Claw-SWE-Bench is a multilingual SWE-bench-style benchmark and adapter protocol for evaluating OpenClaw-style agent harnesses on coding tasks.
- 02The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories.
- 03Claw-SWE-Bench Lite is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns.
- 04OpenClaw with a minimal direct-diff adapter scores 19.1% Pass@1; the full adapter scores 73.4% Pass@1 — both using the same GLM 5.1 backbone.
- 05Model choice changes Pass@1 by 29.4 percentage points; harness choice changes it by 27.4 pp under fixed models.
- 06Systems with similar accuracy can differ substantially in total API cost, making cost a first-class evaluation axis.
- 07Data is publicly available on GitHub and Hugging Face.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →