Apr 22, 2026·1 min readResearch Papers

SWE-chat dataset reveals how developers really use AI coding agents

Researchers Joachim Baumann, Vishakh Padmakumar, and Xiang Li introduce SWE-chat, the first large-scale dataset of real-world coding agent sessions, containing 6,000 sessions, 63,000+ user prompts, and 355,000 agent tool calls collected from open-source developers.

ArXiv·Joachim Baumann, Vishakh Padmakumar, Xiang Li

Read at source

Composite

7.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Practitioners and researchers evaluating AI coding agents can use SWE-chat's real-world interaction traces to benchmark agent reliability, study failure modes, and design interventions that address the security and code-survival gaps that curated benchmarks miss.

01["SWE-chat contains 6,000 sessions, 63,000+ user prompts, and 355,000 agent tool calls from open-source developers.", "It is a living dataset — the collection pipeline automatically and continually discovers new sessions from public repositories.", "Coding patterns are bimodal: agents author virtually all committed code in 41% of sessions ("vibe coding"), while humans write all code in 23%.", "Only 44% of all agent-produced code survives into user commits.", "Agent-written code introduces more security vulnerabilities than human-authored code.", "Users push back against agent outputs — via corrections, failure reports, and interruptions — in 44% of all turns.", "The dataset captures complete interaction traces with human vs. agent code authorship attribution."]

Summary— our read of the original

Baumann, Padmakumar, and Li introduce SWE-chat, described as the first large-scale dataset of real coding agent sessions drawn from open-source developers working in natural settings. The dataset currently holds 6,000 sessions comprising more than 63,000 user prompts and 355,000 agent tool calls. Unlike curated benchmarks, SWE-chat is a living dataset — its collection pipeline automatically and continually discovers and processes sessions from public repositories, meaning the dataset grows over time without manual curation.

The paper's empirical analysis surfaces several notable findings about how coding agents actually perform in practice.

The paper's empirical analysis surfaces several notable findings about how coding agents actually perform in practice. Coding behavior is bimodal: in 41% of sessions, agents author virtually all committed code (a pattern the authors label "vibe coding"), while in 23% of sessions humans write all the code themselves. Despite rapidly improving model capabilities, just 44% of all agent-produced code ultimately survives into user commits. Agent-written code also introduces more security vulnerabilities than human-authored code. User friction is substantial — through corrections, failure reports, and interruptions, users push back against agent outputs in 44% of all turns.

The authors position SWE-chat as an empirical foundation for moving beyond curated benchmarks toward an evidence-based understanding of AI agent performance in real developer workflows. By capturing complete interaction traces with human vs. agent code authorship attribution, the dataset enables researchers to study failure modes, efficiency gaps, and user behavior in ways that controlled benchmarks cannot.

Key facts

01["SWE-chat contains 6,000 sessions, 63,000+ user prompts, and 355,000 agent tool calls from open-source developers.", "It is a living dataset — the collection pipeline automatically and continually discovers new sessions from public repositories.", "Coding patterns are bimodal: agents author virtually all committed code in 41% of sessions ("vibe coding"), while humans write all code in 23%.", "Only 44% of all agent-produced code survives into user commits.", "Agent-written code introduces more security vulnerabilities than human-authored code.", "Users push back against agent outputs — via corrections, failure reports, and interruptions — in 44% of all turns.", "The dataset captures complete interaction traces with human vs. agent code authorship attribution."]

Topics

#benchmarks #coding-assistant #agent-framework #empirical-study #developer-tools

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 23, 2026 · 11:04 UTC. How this works →

SWE-chat dataset reveals how developers really use AI coding agents

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics