Test-time scaling framework boosts agentic coding benchmarks
A new paper by Joongwon Kim and 15 co-authors proposes a test-time compute scaling framework for long-horizon coding agents that converts rollout trajectories into compact structured summaries, lifting Claude-4.5-Opus from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0.
Score breakdown
Teams building or evaluating agentic coding systems can apply RTV and PDR-style trajectory summarization at inference time to meaningfully boost benchmark performance without retraining models.
- 01The paper is authored by Joongwon Kim and 15 co-authors and submitted to arXiv on April 16, 2026 (arXiv:2604.16529).
- 02Existing test-time scaling methods fail for long-horizon agents because each attempt produces an extended trajectory, not a short comparable output.
- 03The framework converts rollouts into compact structured summaries preserving hypotheses, progress, and failure modes.
Joongwon Kim and 15 co-authors present a test-time compute scaling framework specifically designed for long-horizon coding agents, addressing a gap left by existing test-time scaling methods. Those methods assume short, bounded outputs that can be directly compared or ranked — an assumption that breaks when an agent produces an extended trajectory of actions, observations, errors, and partial progress. The paper's core insight is that the bottleneck is not generating more attempts, but representing prior experience in a form that can be effectively selected from and reused.
The framework converts each rollout into a compact structured summary that retains salient hypotheses, progress indicators, and failure modes while discarding low-signal trace details.
The framework converts each rollout into a compact structured summary that retains salient hypotheses, progress indicators, and failure modes while discarding low-signal trace details. Built on this representation, two complementary scaling strategies are introduced: Recursive Tournament Voting (RTV) for parallel scaling, which recursively narrows a population of rollout summaries through small-group comparisons; and an agentic adaptation of Parallel-Distill-Refine (PDR) for sequential scaling, which conditions new rollouts on summaries distilled from prior attempts. Evaluated on SWE-Bench Verified and Terminal-Bench v2.0, the method consistently improves frontier coding agents — most notably lifting Claude-4.5-Opus from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and from 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). The paper spans 70 pages, 26 figures, and 12 tables, and is submitted under Computer Science — Software Engineering on arXiv.
Key facts
- 01The paper is authored by Joongwon Kim and 15 co-authors and submitted to arXiv on April 16, 2026 (arXiv:2604.16529).
- 02Existing test-time scaling methods fail for long-horizon agents because each attempt produces an extended trajectory, not a short comparable output.
- 03The framework converts rollouts into compact structured summaries preserving hypotheses, progress, and failure modes.
- 04Recursive Tournament Voting (RTV) enables parallel scaling by recursively narrowing rollout summaries through small-group comparisons.
- 05Parallel-Distill-Refine (PDR) is adapted for sequential scaling by conditioning new rollouts on summaries from prior attempts.
- 06Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) using the framework.