Jun 4, 2026·1 min readResearch Papers

RHO lets AI agents self-optimize without labeled data

Researchers introduce Retrospective Harness Optimization (RHO), a self-supervised method that improves AI agent harnesses using only past trajectories, boosting SWE-Bench Pro pass rates from 59% to 78% in a single optimization round.

ArXiv·Wenbo Pan, Shujie Liu, Chin-Yew Lin

Read at source

Composite

6.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

RHO demonstrates that AI agents can meaningfully self-improve their harness without any labeled validation data, removing a key bottleneck for deploying and continuously optimizing agents in practical settings.

01RHO is a self-supervised method that optimizes an agent's harness using only past trajectories, requiring no ground-truth labeled data.
02It selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel.
03The agent uses self-validation, self-consistency, and pairwise self-preference to choose the best harness update.

Summary— our read of the original

Wenbo Pan, Shujie Liu, and Chin-Yew Lin present Retrospective Harness Optimization (RHO), a self-supervised method designed to continuously improve the skills, tools, and workflows — collectively called the "harness" — that AI agents rely on to solve complex problems. The core challenge RHO addresses is that existing optimization methods typically require ground-truth validation sets, which are difficult to obtain in real-world deployment. RHO sidesteps this requirement by operating entirely on past trajectories: it selects a diverse coreset of challenging tasks from prior runs, re-solves them in parallel, and then uses self-validation and self-consistency to analyze the resulting rollouts. The agent then generates candidate harness updates and selects the most effective one through pairwise self-preference — no external grader required.

The method is evaluated across three diverse domains spanning software engineering, technical work, and knowledge work.

The method is evaluated across three diverse domains spanning software engineering, technical work, and knowledge work. The headline result is a single optimization round lifting the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Beyond raw performance, the paper's analysis shows that RHO specifically targets prior failure modes, alters the agent's behavior patterns, and sustains higher accuracy during long-horizon sessions — suggesting the improvements are structural rather than incidental.

Key facts

01RHO is a self-supervised method that optimizes an agent's harness using only past trajectories, requiring no ground-truth labeled data.
02It selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel.
03The agent uses self-validation, self-consistency, and pairwise self-preference to choose the best harness update.
04A single RHO optimization round improved the pass rate on SWE-Bench Pro from 59% to 78% with no external grading.
05RHO was evaluated across three domains: software engineering, technical work, and knowledge work.
06Analysis shows RHO targets prior failure modes and sustains higher accuracy during long-horizon sessions.
07Authors are Wenbo Pan, Shujie Liu, and Chin-Yew Lin.

Topics

#agent-framework #reasoning #benchmarks #self-supervised-learning #llm-optimization

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →

Jun 4, 2026·1 min readResearch Papers

RHO lets AI agents self-optimize without labeled data

ArXiv·Wenbo Pan, Shujie Liu, Chin-Yew Lin

Read at source

Composite

6.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01RHO is a self-supervised method that optimizes an agent's harness using only past trajectories, requiring no ground-truth labeled data.
02It selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel.
03The agent uses self-validation, self-consistency, and pairwise self-preference to choose the best harness update.

Summary— our read of the original

The method is evaluated across three diverse domains spanning software engineering, technical work, and knowledge work.

Key facts

01RHO is a self-supervised method that optimizes an agent's harness using only past trajectories, requiring no ground-truth labeled data.
02It selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel.
03The agent uses self-validation, self-consistency, and pairwise self-preference to choose the best harness update.
04A single RHO optimization round improved the pass rate on SWE-Bench Pro from 59% to 78% with no external grading.
05RHO was evaluated across three domains: software engineering, technical work, and knowledge work.
06Analysis shows RHO targets prior failure modes and sustains higher accuracy during long-horizon sessions.
07Authors are Wenbo Pan, Shujie Liu, and Chin-Yew Lin.

Topics

#agent-framework #reasoning #benchmarks #self-supervised-learning #llm-optimization

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.