Jun 12, 2026·1 min readResearch Papers

Dialogue SWE-Bench tests coding agents on interactive problem-solving

Brendan King and Jeffrey Flanigan introduce Dialogue SWE-Bench, a benchmark that evaluates coding agents on resolving real-world software engineering problems through dialogue with a user, rather than as fully autonomous systems.

ArXiv·Brendan King, Jeffrey Flanigan

Read at source

Composite

6.1

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The benchmark reveals that dialogue capability is a distinct dimension of coding agent performance not captured by existing autonomous-system evaluations, exposing a gap between how agents are benchmarked and how they are actually used.

01Dialogue SWE-Bench is a new automatic benchmark for evaluating coding agents on dialogue-driven software engineering tasks.
02Existing benchmarks evaluate coding agents as fully autonomous systems, despite their interactive real-world use.
03The benchmark uses a persona-grounded user simulator to support task evaluation.

Summary— our read of the original

Brendan King and Jeffrey Flanigan observe that while AI coding agents are widely deployed as interactive assistants, existing benchmarks — including the original SWE-Bench — evaluate them as fully autonomous systems, ignoring the dialogue dimension of real-world use. To close this gap, they introduce Dialogue SWE-Bench, an automatic benchmark dataset that measures how well coding agents can resolve real-world software engineering problems through back-and-forth conversation with a user.

The benchmark is supported by a novel persona-grounded user simulator designed specifically for task evaluation, and it augments standard task metrics with automatic evaluations of dialogue quality. The authors also propose a schema-guided agent aimed at improving the dialogue capabilities of off-the-shelf coding agents; this approach improves over strong baselines by 3–14%. Their results reveal that better coding models do not always correspond to better dialogue models, indicating that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

Key facts

01Dialogue SWE-Bench is a new automatic benchmark for evaluating coding agents on dialogue-driven software engineering tasks.
02Existing benchmarks evaluate coding agents as fully autonomous systems, despite their interactive real-world use.
03The benchmark uses a persona-grounded user simulator to support task evaluation.
04Automatic evaluations of dialogue quality are included alongside task-completion metrics.
05A proposed schema-guided agent improves over strong baselines by 3–14%.
06Better coding models do not always correspond to better dialogue models, per the paper's results.
07The paper is authored by Brendan King and Jeffrey Flanigan.

Topics

#benchmarks #agent-framework #coding-assistant #multi-agent #tool-use

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 15, 2026 · 11:57 UTC. How this works →

Jun 12, 2026·1 min readResearch Papers

Dialogue SWE-Bench tests coding agents on interactive problem-solving

ArXiv·Brendan King, Jeffrey Flanigan

Read at source

Composite

6.1

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Dialogue SWE-Bench is a new automatic benchmark for evaluating coding agents on dialogue-driven software engineering tasks.
02Existing benchmarks evaluate coding agents as fully autonomous systems, despite their interactive real-world use.
03The benchmark uses a persona-grounded user simulator to support task evaluation.

Summary— our read of the original

Key facts

01Dialogue SWE-Bench is a new automatic benchmark for evaluating coding agents on dialogue-driven software engineering tasks.
02Existing benchmarks evaluate coding agents as fully autonomous systems, despite their interactive real-world use.
03The benchmark uses a persona-grounded user simulator to support task evaluation.
04Automatic evaluations of dialogue quality are included alongside task-completion metrics.
05A proposed schema-guided agent improves over strong baselines by 3–14%.
06Better coding models do not always correspond to better dialogue models, per the paper's results.
07The paper is authored by Brendan King and Jeffrey Flanigan.

Topics

#benchmarks #agent-framework #coding-assistant #multi-agent #tool-use

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.