SWE-Explore benchmark isolates repository exploration in coding agents
Shaoqiu Zhang, Yuhang Wang, and Jialiang Liang introduce SWE-Explore, a benchmark that evaluates coding agents specifically on repository exploration — covering 848 issues across 10 programming languages and 203 open-source repositories.
Score breakdown
SWE-Explore provides a fine-grained diagnostic lens on coding agent capabilities that binary benchmarks like SWE-bench cannot offer, enabling targeted measurement of where exploration quality breaks down before the repair stage.
- 01SWE-Explore is a new benchmark that isolates repository exploration as a standalone evaluation target for coding agents.
- 02It covers 848 issues across 203 open-source repositories and 10 programming languages.
- 03Agents must return a ranked list of relevant code regions under a fixed line budget.
Shaoqiu Zhang, Yuhang Wang, and Jialiang Liang present SWE-Explore, a benchmark that addresses a gap in how coding agents are evaluated. Existing repository-level benchmarks like SWE-bench measure task success as a holistic binary outcome — resolved or unresolved — and therefore do not capture fine-grained capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. SWE-Explore isolates one of these capabilities, repository exploration, by asking an agent to return a ranked list of relevant code regions given a repository and an issue, subject to a fixed line budget.
The benchmark comprises 848 issues drawn from 203 open-source repositories spanning 10 programming languages.
The benchmark comprises 848 issues drawn from 203 open-source repositories spanning 10 programming languages. Ground truth is constructed at the line level by distilling the specific code regions consulted along the solution paths of independent agent trajectories that successfully resolved the same issues. Agents are then evaluated along three dimensions: coverage, ranking, and context-efficiency. The authors report that these metrics strongly track downstream repair behavior, validating exploration quality as a meaningful predictor of end-to-end coding performance.
Across a broad comparison of classical retrieval methods, general coding agents, and specialized localizers, the study finds that agentic explorers occupy a clear tier above classical retrieval. File-level localization is already strong for modern methods, but line-level coverage and efficient ranking emerge as the key axes that differentiate state-of-the-art explorers from one another.
Key facts
- 01SWE-Explore is a new benchmark that isolates repository exploration as a standalone evaluation target for coding agents.
- 02It covers 848 issues across 203 open-source repositories and 10 programming languages.
- 03Agents must return a ranked list of relevant code regions under a fixed line budget.
- 04Line-level ground truth is derived from trajectories of agents that successfully solved each issue.
- 05Evaluation uses three dimensions: coverage, ranking, and context-efficiency.
- 06These metrics are shown to strongly track downstream repair behavior.
- 07Agentic explorers form a clear tier above classical retrieval; line-level coverage and efficient ranking are the key differentiators among top methods.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →