SWE-chat dataset reveals how developers really use AI coding agents
Researchers Joachim Baumann, Vishakh Padmakumar, and Xiang Li introduce SWE-chat, the first large-scale dataset of real-world coding agent sessions, containing 6,000 sessions, 63,000+ user prompts, and 355,000 agent tool calls collected from open-source developers.
Score breakdown
Practitioners and researchers evaluating AI coding agents can use SWE-chat's real-world interaction traces to benchmark agent reliability, study failure modes, and design interventions that address the security and code-survival gaps that curated benchmarks miss.
- 01["SWE-chat contains 6,000 sessions, 63,000+ user prompts, and 355,000 agent tool calls from open-source developers.", "It is a living dataset — the collection pipeline automatically and continually discovers new sessions from public repositories.", "Coding patterns are bimodal: agents author virtually all committed code in 41% of sessions ("vibe coding"), while humans write all code in 23%.", "Only 44% of all agent-produced code survives into user commits.", "Agent-written code introduces more security vulnerabilities than human-authored code.", "Users push back against agent outputs — via corrections, failure reports, and interruptions — in 44% of all turns.", "The dataset captures complete interaction traces with human vs. agent code authorship attribution."]
Baumann, Padmakumar, and Li introduce SWE-chat, described as the first large-scale dataset of real coding agent sessions drawn from open-source developers working in natural settings. The dataset currently holds 6,000 sessions comprising more than 63,000 user prompts and 355,000 agent tool calls. Unlike curated benchmarks, SWE-chat is a living dataset — its collection pipeline automatically and continually discovers and processes sessions from public repositories, meaning the dataset grows over time without manual curation.
The paper's empirical analysis surfaces several notable findings about how coding agents actually perform in practice.
The paper's empirical analysis surfaces several notable findings about how coding agents actually perform in practice. Coding behavior is bimodal: in 41% of sessions, agents author virtually all committed code (a pattern the authors label "vibe coding"), while in 23% of sessions humans write all the code themselves. Despite rapidly improving model capabilities, just 44% of all agent-produced code ultimately survives into user commits. Agent-written code also introduces more security vulnerabilities than human-authored code. User friction is substantial — through corrections, failure reports, and interruptions, users push back against agent outputs in 44% of all turns.
The authors position SWE-chat as an empirical foundation for moving beyond curated benchmarks toward an evidence-based understanding of AI agent performance in real developer workflows. By capturing complete interaction traces with human vs. agent code authorship attribution, the dataset enables researchers to study failure modes, efficiency gaps, and user behavior in ways that controlled benchmarks cannot.
Key facts
- 01["SWE-chat contains 6,000 sessions, 63,000+ user prompts, and 355,000 agent tool calls from open-source developers.", "It is a living dataset — the collection pipeline automatically and continually discovers new sessions from public repositories.", "Coding patterns are bimodal: agents author virtually all committed code in 41% of sessions ("vibe coding"), while humans write all code in 23%.", "Only 44% of all agent-produced code survives into user commits.", "Agent-written code introduces more security vulnerabilities than human-authored code.", "Users push back against agent outputs — via corrections, failure reports, and interruptions — in 44% of all turns.", "The dataset captures complete interaction traces with human vs. agent code authorship attribution."]