Parallel domain-scoped agents outperform linear repo exploration for multi-file bug localization
A paper by Akeela Darryl Fattha, Kia Ying Chua, and Lingxiao Jiang argues that LLM agents exploring code repositories linearly are structurally mismatched for multi-file changes, and shows that domain-scoped parallel agent spawning achieves the highest micro F1 among Haiku-class models on a SWE-Bench Pro–based benchmark.
Score breakdown
The paper demonstrates that replacing linear repository traversal with domain-scoped parallel agent spawning improves multi-file change localization for a small model, while also identifying that naive filesystem access and forced multi-agent consultation can actively harm performance or inflate costs.
- 01Most LLM agents explore repositories linearly (one directory or file per step), which the authors argue is a structural mismatch for multi-file changes spanning subsystems.
- 02The study uses SWE-Bench Pro as its initial benchmark, with Ansible as the primary exemplar, evaluated via persistent-session GitHub issue resolution anchored at a single base commit.
- 03Domain-scoped parallel agent spawning with a Haiku-class model achieves the highest micro F1 among Haiku-class models by a large margin.
Akeela Darryl Fattha, Kia Ying Chua, and Lingxiao Jiang argue that the dominant pattern of linear, sequential repository traversal — visiting one directory or file per step — is structurally ill-suited to software issues that span multiple subsystems. To test this hypothesis, they construct a persistent-session evaluation framework anchored at a single base commit using GitHub issues from the Ansible project, drawn from SWE-Bench Pro. They compare four systems: a base LLM with no direct repository access, a single-agent Recursive Language Model (RLM) baseline with a persistent Python REPL, an external CLI baseline using Codex 5.5 High, and their own domain-scoped parallel agent spawning approach.
On the original curated 2020 SWE-Bench Pro benchmark, a larger Sonnet plain LLM baseline attains a higher micro F1 by predicting fewer files, yielding higher precision but at significantly lower all-gold recall.
On their expanded benchmark — which includes more recent PRs from 2025 and 2026 — the domain-agent system using a small Haiku-class model achieves the highest micro F1 among Haiku-class models by a large margin, finishing second overall behind only the much larger Codex 5.5 High. On the original curated 2020 SWE-Bench Pro benchmark, a larger Sonnet plain LLM baseline attains a higher micro F1 by predicting fewer files, yielding higher precision but at significantly lower all-gold recall.
Beyond the primary comparison, the paper reports three additional findings. First, documentation evolution represents a latent dependency that none of the evaluated approaches successfully resolves. Second, naive filesystem access can actually degrade localization quality by causing over-prediction of test files. Third, forcing multi-agent consultation does not produce measurable localization improvements and substantially increases token costs.
Key facts
- 01Most LLM agents explore repositories linearly (one directory or file per step), which the authors argue is a structural mismatch for multi-file changes spanning subsystems.
- 02The study uses SWE-Bench Pro as its initial benchmark, with Ansible as the primary exemplar, evaluated via persistent-session GitHub issue resolution anchored at a single base commit.
- 03Domain-scoped parallel agent spawning with a Haiku-class model achieves the highest micro F1 among Haiku-class models by a large margin.
- 04On an expanded benchmark including PRs from 2025 and 2026, the domain-agent system ranks second overall, behind only Codex 5.5 High.
- 05On the original 2020 SWE-Bench Pro benchmark, a larger Sonnet plain LLM baseline achieves higher micro F1 via higher precision, but at significantly lower all-gold recall.
- 06Naive filesystem access can degrade localization by causing over-prediction of test files.
- 07Forced multi-agent consultation does not measurably improve localization and substantially raises token costs.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →