LoopCoder-v2 finds two loops is the sweet spot for parallel loop Transformers
Jian Yang, Shawn Guo, and Wei Zhang introduce LoopCoder-v2, a family of 7B parallel loop Transformer coders trained on 18T tokens, finding that exactly two loops maximizes performance — boosting SWE-bench Verified from 43.0 to 64.4 points — while three or more loops cause regression.
Score breakdown
The paper establishes that PLT performance saturates at exactly two loops and provides a gain–cost diagnostic framework explaining why, giving practitioners a principled basis for loop-count selection rather than relying on monotonic scaling assumptions.
- 01LoopCoder-v2 is a family of 7B Parallel Loop Transformer (PLT) coding models trained from scratch on 18T tokens.
- 02The two-loop variant improves SWE-bench Verified from 43.0 to 64.4 points over the non-looped baseline.
- 03Multi-SWE score improves from 14.0 to 31.0 points with the two-loop variant.
Jian Yang, Shawn Guo, and Wei Zhang study how loop count affects Parallel Loop Transformers (PLT), an architecture that scales latent computation by repeatedly applying shared blocks while mitigating the latency and KV-cache memory costs of sequential looping through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention. To investigate this, they train LoopCoder-v2 — a family of 7B PLT coders with varying loop counts — from scratch on 18T tokens, followed by matched instruction tuning and evaluation across code generation, code reasoning, agentic software engineering, and tool-use benchmarks.
The authors' diagnostics explain this saturation: loop 2 provides the main productive representational refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity.
The central finding is a strongly non-monotonic loop-count effect: the two-loop variant improves SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points over the non-looped baseline, but variants with three or more loops regress. The authors' diagnostics explain this saturation: loop 2 provides the main productive representational refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced positional mismatch cost remains roughly fixed even as refinement gains shrink, the offset cost increasingly dominates beyond two loops, causing performance to decline. This gain–cost framework provides a principled basis for loop-count selection in PLT architectures.
Key facts
- 01LoopCoder-v2 is a family of 7B Parallel Loop Transformer (PLT) coding models trained from scratch on 18T tokens.
- 02The two-loop variant improves SWE-bench Verified from 43.0 to 64.4 points over the non-looped baseline.
- 03Multi-SWE score improves from 14.0 to 31.0 points with the two-loop variant.
- 04Models with three or more loops regress, revealing a strongly non-monotonic loop-count effect.
- 05PLT uses cross-loop position offsets (CLP) and shared-KV gated sliding-window attention to reduce latency and KV-cache memory costs.
- 06Diagnostics show loop 2 provides the main productive refinement; later loops yield diminishing, oscillatory updates and reduced representational diversity.
- 07The CLP-induced positional mismatch cost remains roughly fixed as refinement gains shrink, causing the offset cost to increasingly dominate beyond two loops.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →