Curation-Bench tests if coding agents can automate AI data curation
Researchers introduce Curation-Bench, an agent-centric benchmark revealing that scaffolded coding agents can autonomously compose data-selection policies that outperform strong published baselines at one-tenth their data budget.
Score breakdown
The paper demonstrates that scaffolded method adaptation — not open-ended prompting — is what enables generalist coding agents to reliably advance data-curation research, a finding with direct implications for how agentic systems are designed for AI development workflows.
- 01Curation-Bench is a new agent-centric benchmark for automating the AI data-curation loop.
- 02The benchmark fixes the model, training recipe, and evaluation suite, giving agents command-line access to inspect data, implement policies, and revise.
- 03Out-of-the-box agents reach strong published data-selection baselines within ten iterations.
Feiyang Kang, Hanze Li, and Adam Nguyen present Curation-Bench, a benchmark framing data curation as an agentic task. The benchmark fixes the model, training recipe, and evaluation suite, then gives agents command-line access to inspect data, implement curation policies, submit them to a training/evaluation pipeline, and revise based on noisy benchmark feedback — mirroring the iterative loop human practitioners follow. The paper instantiates the benchmark in a vision-language instruction-tuning setting.
Without special scaffolding, out-of-the-box agents reach strong published data-selection baselines within ten iterations.
Without special scaffolding, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent "execution-research gap": agents predominantly tune local variants of an initial policy rather than exploring fundamentally different policy families, even when provided with strategy guides and paper references. To address this, the authors introduce scaffolds that require each iteration to cite, instantiate, and adapt a prior method, shifting agent behavior toward method-guided exploration. Under this scaffolded regime, the agent autonomously composes — without human design input — a data-selection policy that outperforms strong published baselines while using only one-tenth of their data budget. The paper concludes that current agents can run the curation loop, but reliable data research requires scaffolded method adaptation rather than open-ended prompting alone. Code and benchmark are open-sourced.
Key facts
- 01Curation-Bench is a new agent-centric benchmark for automating the AI data-curation loop.
- 02The benchmark fixes the model, training recipe, and evaluation suite, giving agents command-line access to inspect data, implement policies, and revise.
- 03Out-of-the-box agents reach strong published data-selection baselines within ten iterations.
- 04Trajectory analysis reveals an 'execution-research gap': agents tune local policy variants rather than exploring new policy families.
- 05Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration.
- 06The scaffolded agent autonomously composes a data-selection policy that outperforms published baselines at one-tenth their data budget.
- 07Code and benchmark are open-sourced.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →