Cognition builds AI system to estimate Devin's engineering productivity
Cognition's team built an automated agent that reviews completed Devin sessions and estimates how many productive engineering hours each session is worth, achieving an r_log of 0.74 on a dataset of 258 sessions from 126 enterprise users.
Score breakdown
The system gives organizations a concrete, automated way to convert AI coding sessions into estimated engineering hours and dollar equivalents — replacing guesswork about AI ROI with a validated, production-running measurement tool.
- 01The system estimates how many productive engineering hours each Devin session is worth, validated against human engineer estimates.
- 02The model achieved an r_log of 0.74 and appears unbiased, making it suitable for estimating aggregated totals.
- 03The ground-truth dataset consists of 258 sessions from 126 users across diverse enterprise customers.
As AI token usage and spend have skyrocketed, engineering leaders have shifted from worrying about whether their teams use enough AI to trying to measure what that usage actually produces. Cognition's team set out to automate productivity measurement for Devin, their autonomous AI software engineer. Rather than attempting to measure dollar impact directly — which the post describes as an unsolved problem — or relying on raw activity metrics like lines of code or commits, they settled on "productive engineering hours": how long a human engineer would have taken to produce the same output. Hours are already how organizations value engineering work, are standardized across organizations, and are convertible to dollar amounts via engineering salaries and contractor rates.
Each Devin session includes a full execution trace — the user's request, every action taken, the resulting code, and codebase context.
To build the system, Cognition collected a ground-truth dataset of 258 sessions from 126 users across a diverse set of enterprise customers, gathered through live interviews and a survey. Each Devin session includes a full execution trace — the user's request, every action taken, the resulting code, and codebase context. An agent reviews each completed session, first classifying whether it produced useful output, then estimating the equivalent human hours. For sessions with a pull request, the classifier uses merge status as a signal: merged PRs are included, unmerged PRs are discarded, with the post noting this approach errs on the conservative side. Sessions without a PR require a more complex classification approach, though the source text is truncated before those details are described. After careful tuning, the model achieved an r_log of 0.74 and appears unbiased, making it reliable for estimating aggregated totals even if individual predictions are imperfect. The post describes this as the first automated system measuring AI engineering productivity in production.
Key facts
- 01The system estimates how many productive engineering hours each Devin session is worth, validated against human engineer estimates.
- 02The model achieved an r_log of 0.74 and appears unbiased, making it suitable for estimating aggregated totals.
- 03The ground-truth dataset consists of 258 sessions from 126 users across diverse enterprise customers.
- 04Data was collected via live interviews and a survey.
- 05Each Devin session includes a full execution trace: the user's request, every action taken, resulting code, and codebase context.
- 06For sessions with a PR, merge status is used as the productivity signal — merged PRs are included, unmerged PRs are discarded.
- 07Cognition describes this as the first automated system measuring AI engineering productivity in production.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 13, 2026 · 08:58 UTC. How this works →