ProcGrep fingerprints coding agents by behavioral habits, not just scores
Researchers introduce ProcGrep, a library that identifies coding agents by their procedural "fingerprints" — behavioral habits extracted from task trajectories — achieving 85.7% accuracy in attributing unseen trajectories to the correct agent across ten models.
Score breakdown
As benchmark scores saturate, ProcGrep provides a concrete mechanism for distinguishing agents by how they solve problems — enabling procedural auditing, task-aware routing, and cost analysis that success-rate metrics alone cannot support.
- 01A probe over procedural fingerprints attributes an unseen trajectory to the correct agent at 85.7% accuracy across ten agents, controlling for task leakage.
- 02The paper introduces 'fingerprints' — procedural signatures derived from agents' behavioral habits — as a complement to benchmark success rates.
- 03A vocabulary induction technique is used to build maximally compressive yet expressive procedural representations of agent problem-solving.
Hamidah Oderinwale's paper argues that benchmark scores reveal what an agent got right but not how it got there, and proposes a framework for comparing agents procedurally across varying models, tasks, and approaches. The core contribution is the concept of agent "fingerprints" — procedural signatures derived from behavioral habits — along with a vocabulary induction technique designed to be maximally compressive (to suppress surface-level variation) while remaining expressive enough to expose model-specific quirks. A probe trained over these signatures correctly attributes an unseen trajectory to the correct agent at 85.7% accuracy across ten agents, with task-leakage controlled.
The framework is applied to SWE-Bench, the software engineering evaluation dataset, to study the structural distinctness of agent trajectories.
The framework is applied to SWE-Bench, the software engineering evaluation dataset, to study the structural distinctness of agent trajectories. The analysis finds that behavioral similarity is highest among models from similar release periods and those in a distillation relationship — a distilled student model and its teacher exhibit a Jensen-Shannon divergence of 0.25, approximately half the divergence observed between other model pairs. The authors introduce ProcGrep, a library for top-down procedural auditing and evaluation of agents given their traces, and identify potential applications including task-aware model routing, agent monitoring, and finer-grained cost analysis.
Key facts
- 01A probe over procedural fingerprints attributes an unseen trajectory to the correct agent at 85.7% accuracy across ten agents, controlling for task leakage.
- 02The paper introduces 'fingerprints' — procedural signatures derived from agents' behavioral habits — as a complement to benchmark success rates.
- 03A vocabulary induction technique is used to build maximally compressive yet expressive procedural representations of agent problem-solving.
- 04The framework is applied to SWE-Bench to study structural distinctness of agent trajectories.
- 05Behavior is most similar between models from similar release periods and those in a distillation relationship.
- 06A distilled student model and its teacher show a Jensen-Shannon divergence of 0.25, roughly half the distance between other model pairs.
- 07The authors release ProcGrep, a library for top-down procedural auditing and evaluation of agents from their traces.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →