RHO lets AI agents self-optimize without labeled data
Researchers introduce Retrospective Harness Optimization (RHO), a self-supervised method that improves AI agent harnesses using only past trajectories, boosting SWE-Bench Pro pass rates from 59% to 78% in a single optimization round.
Score breakdown
RHO demonstrates that AI agents can meaningfully self-improve their harness without any labeled validation data, removing a key bottleneck for deploying and continuously optimizing agents in practical settings.
- 01RHO is a self-supervised method that optimizes an agent's harness using only past trajectories, requiring no ground-truth labeled data.
- 02It selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel.
- 03The agent uses self-validation, self-consistency, and pairwise self-preference to choose the best harness update.
Wenbo Pan, Shujie Liu, and Chin-Yew Lin present Retrospective Harness Optimization (RHO), a self-supervised method designed to continuously improve the skills, tools, and workflows — collectively called the "harness" — that AI agents rely on to solve complex problems. The core challenge RHO addresses is that existing optimization methods typically require ground-truth validation sets, which are difficult to obtain in real-world deployment. RHO sidesteps this requirement by operating entirely on past trajectories: it selects a diverse coreset of challenging tasks from prior runs, re-solves them in parallel, and then uses self-validation and self-consistency to analyze the resulting rollouts. The agent then generates candidate harness updates and selects the most effective one through pairwise self-preference — no external grader required.
The method is evaluated across three diverse domains spanning software engineering, technical work, and knowledge work.
The method is evaluated across three diverse domains spanning software engineering, technical work, and knowledge work. The headline result is a single optimization round lifting the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Beyond raw performance, the paper's analysis shows that RHO specifically targets prior failure modes, alters the agent's behavior patterns, and sustains higher accuracy during long-horizon sessions — suggesting the improvements are structural rather than incidental.
Key facts
- 01RHO is a self-supervised method that optimizes an agent's harness using only past trajectories, requiring no ground-truth labeled data.
- 02It selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel.
- 03The agent uses self-validation, self-consistency, and pairwise self-preference to choose the best harness update.
- 04A single RHO optimization round improved the pass rate on SWE-Bench Pro from 59% to 78% with no external grading.
- 05RHO was evaluated across three domains: software engineering, technical work, and knowledge work.
- 06Analysis shows RHO targets prior failure modes and sustains higher accuracy during long-horizon sessions.
- 07Authors are Wenbo Pan, Shujie Liu, and Chin-Yew Lin.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →