Asuka-Bench tests code agents on vague prompts and iterative refinement
Asuka-Bench is a new benchmark that evaluates code agents on underspecified web development tasks requiring multi-round feedback loops, revealing large performance gaps among 8 tested LLMs.
Score breakdown
Asuka-Bench exposes a dimension of code-agent capability — iterative repair from vague, evolving requirements — that existing one-shot benchmarks do not measure, and its unsaturated results (top model at 52%) show it remains a meaningful challenge for current LLMs.
- 01Asuka-Bench evaluates code agents on underspecified user intent and multi-round refinement, not one-shot complete-prompt tasks.
- 02The benchmark comprises 50 web tasks, 784 evaluation criteria, and 2,402 expected outcomes.
- 03Evaluation uses a closed loop: a Code Agent, a UI Agent running test cases on a deployed site, and a User LLM generating natural-language feedback.
Asuka-Bench, introduced by Xin Wang, Liangtai Sun, and Yaoming Zhu, targets a fundamental mismatch between how existing code-generation benchmarks work and how real web development actually unfolds. Standard benchmarks score a single mapping from a complete prompt to a one-shot output, but real users rarely provide full specifications upfront — requirements often emerge only after they see and react to an intermediate result. Asuka-Bench addresses this by pairing underspecified user intent with a multi-round refinement loop grounded in browser-rendered behavior.
Testing 8 LLMs across 2 agent frameworks, the results reveal a 38 percentage-point spread in weighted Task Pass Rate, and models also differ substantially in their ability to repair from feedback.
The benchmark's evaluation pipeline forms a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM converts evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2,402 expected outcomes. Testing 8 LLMs across 2 agent frameworks, the results reveal a 38 percentage-point spread in weighted Task Pass Rate, and models also differ substantially in their ability to repair from feedback. Notably, the benchmark is far from saturated — even the top-performing model completes only 52% of projects after three rounds — suggesting it remains a challenging and meaningful evaluation target.
Key facts
- 01Asuka-Bench evaluates code agents on underspecified user intent and multi-round refinement, not one-shot complete-prompt tasks.
- 02The benchmark comprises 50 web tasks, 784 evaluation criteria, and 2,402 expected outcomes.
- 03Evaluation uses a closed loop: a Code Agent, a UI Agent running test cases on a deployed site, and a User LLM generating natural-language feedback.
- 048 LLMs were benchmarked across 2 agent frameworks.
- 05Weighted Task Pass Rate varies by 38 percentage points across tested models.
- 06Even the strongest model completes only 52% of projects after three rounds.
- 07Models differ substantially in their ability to repair from feedback, not just in initial generation quality.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →