Dialogue SWE-Bench tests coding agents on interactive problem-solving
Brendan King and Jeffrey Flanigan introduce Dialogue SWE-Bench, a benchmark that evaluates coding agents on resolving real-world software engineering problems through dialogue with a user, rather than as fully autonomous systems.
Score breakdown
The benchmark reveals that dialogue capability is a distinct dimension of coding agent performance not captured by existing autonomous-system evaluations, exposing a gap between how agents are benchmarked and how they are actually used.
- 01Dialogue SWE-Bench is a new automatic benchmark for evaluating coding agents on dialogue-driven software engineering tasks.
- 02Existing benchmarks evaluate coding agents as fully autonomous systems, despite their interactive real-world use.
- 03The benchmark uses a persona-grounded user simulator to support task evaluation.
Brendan King and Jeffrey Flanigan observe that while AI coding agents are widely deployed as interactive assistants, existing benchmarks — including the original SWE-Bench — evaluate them as fully autonomous systems, ignoring the dialogue dimension of real-world use. To close this gap, they introduce Dialogue SWE-Bench, an automatic benchmark dataset that measures how well coding agents can resolve real-world software engineering problems through back-and-forth conversation with a user.
The benchmark is supported by a novel persona-grounded user simulator designed specifically for task evaluation, and it augments standard task metrics with automatic evaluations of dialogue quality. The authors also propose a schema-guided agent aimed at improving the dialogue capabilities of off-the-shelf coding agents; this approach improves over strong baselines by 3–14%. Their results reveal that better coding models do not always correspond to better dialogue models, indicating that dialogue capability is a distinct and currently understudied dimension of coding agent performance.
Key facts
- 01Dialogue SWE-Bench is a new automatic benchmark for evaluating coding agents on dialogue-driven software engineering tasks.
- 02Existing benchmarks evaluate coding agents as fully autonomous systems, despite their interactive real-world use.
- 03The benchmark uses a persona-grounded user simulator to support task evaluation.
- 04Automatic evaluations of dialogue quality are included alongside task-completion metrics.
- 05A proposed schema-guided agent improves over strong baselines by 3–14%.
- 06Better coding models do not always correspond to better dialogue models, per the paper's results.
- 07The paper is authored by Brendan King and Jeffrey Flanigan.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 15, 2026 · 11:57 UTC. How this works →