GameCraft-Bench tests if agents can build full games in Godot
GameCraft-Bench is a new benchmark of 140 Godot tasks across 15 game families that tests whether frontier coding agents can generate complete, playable games end-to-end — and finds the best agent scores only 41.46%.
Score breakdown
GameCraft-Bench exposes a concrete ceiling on current coding agents' ability to produce fully playable games, showing that even the best frontier models fall below 41.46% on a task requiring integrated scripts, scenes, assets, and runtime interaction — a gap that partial code-generation benchmarks do not capture.
- 01GameCraft-Bench is a benchmark for evaluating end-to-end game generation by coding agents inside a real game engine.
- 02The benchmark comprises 140 tasks across 15 game families, all targeting the Godot engine.
- 03Evaluation uses replayed gameplay demonstrations and rubric-guided multimodal judging.
Tongxu Luo, Rongsheng Wang, and Jiaxi Bi present GameCraft-Bench, a benchmark that formalizes end-to-end game generation as the problem of producing a complete game artifact — covering scripts, scenes, assets, rendering, and runtime interactions — that realizes a natural-language specification through observable player-game interaction in a target environment. The authors argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. To satisfy these, they propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging.
GameCraft-Bench instantiates this framework as 140 Godot tasks distributed across 15 game families.
GameCraft-Bench instantiates this framework as 140 Godot tasks distributed across 15 game families. Evaluations of frontier coding agents reveal that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis shows that while agents often implement recognizable game mechanics, they consistently struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation — pointing to a gap between partial mechanical implementation and fully realized, playable experiences.
Key facts
- 01GameCraft-Bench is a benchmark for evaluating end-to-end game generation by coding agents inside a real game engine.
- 02The benchmark comprises 140 tasks across 15 game families, all targeting the Godot engine.
- 03Evaluation uses replayed gameplay demonstrations and rubric-guided multimodal judging.
- 04The framework is defined by three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification.
- 05The strongest frontier coding agent achieves only 41.46% on the benchmark.
- 06Most evaluated agents score below 40%.
- 07Agents often implement recognizable mechanics but fail on complete content, visual feedback, and coherent presentation.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →