AI coding agents tested on neuroscience data-to-discovery pipeline
A new empirical study by Horstmann, Lin, and Robie evaluates general-purpose coding agents on a fly optogenetics data-to-discovery pipeline, finding that agents can handle individual stages but fail to complete the full end-to-end workflow.
Score breakdown
The study reveals that the gap between stage-level and end-to-end pipeline automation in real scientific workflows is a distinct, underexplored challenge not captured by existing coding agent benchmarks.
- 01The study evaluates general-purpose coding agents on a fly optogenetics data-to-discovery pipeline.
- 02Tasks are substantially larger than existing benchmarks, with datasets orders of magnitude bigger.
- 03Agents can solve several individual pipeline stages, suggesting stage-level automation is tractable.
Horstmann, Lin, and Robie present an empirical study of general-purpose coding agents applied to a fly optogenetics data-to-discovery pipeline — a scientific workflow where domain experts typically spend days to months on individual stages. Unlike standard coding benchmarks, the evaluation uses tasks substantially larger in scope, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards rather than implementation correctness alone. The study finds that agents can successfully solve several individual pipeline stages, indicating that stage-level automation is tractable, but that stringing together successes across all stages to complete the end-to-end pipeline correctly is beyond agents' current abilities.
The paper identifies a critical open challenge: agents struggle most when there is no pre-defined criterion to iterate on and they must instead exercise scientific judgment to assess their own solutions.
The paper identifies a critical open challenge: agents struggle most when there is no pre-defined criterion to iterate on and they must instead exercise scientific judgment to assess their own solutions. Mirroring how scientists work, agents sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Beyond this judgment gap, the study surfaces challenges that are largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. The authors conclude by distilling principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.
Key facts
- 01The study evaluates general-purpose coding agents on a fly optogenetics data-to-discovery pipeline.
- 02Tasks are substantially larger than existing benchmarks, with datasets orders of magnitude bigger.
- 03Agents can solve several individual pipeline stages, suggesting stage-level automation is tractable.
- 04Completing the full end-to-end pipeline correctly is beyond agents' current abilities.
- 05Agents struggle most when no pre-defined iteration criterion exists and scientific judgment is required for self-evaluation.
- 06Agents sometimes attempt visual inspection of intermediate outputs but largely fail to interpret or act on what they see.
- 07The paper identifies computational resource management and generalization to large held-out data as challenges absent from existing benchmarks.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →