Automated prompt optimization lifts LLM game agents from 0% to 72.5% on PutNext
A new automated prompt optimization framework for LLM agents iteratively refines prompts using environment feedback, reaching 72.5% success on a task where the baseline agent scores 0%.
Score breakdown
The framework demonstrates that automated prompt optimization alone — without any fine-tuning — can turn a completely failing LLM agent (0% on PutNext) into one that succeeds nearly three-quarters of the time, showing prompt engineering can be systematically automated rather than done by hand.
- 01The framework decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent.
- 02Prompts are refined through an LLM-driven evolutionary loop guided by environment returns — no model weight updates required.
- 03A behavior analyzer attributes episode outcomes to specific prompt components; a mutator then proposes targeted revisions.
Fernandes, Fehring, and Eimer present an automated prompt optimization framework designed to address the high sensitivity of LLM agents to their prompts in interactive environments, where prompt engineering has traditionally been a manual, task-specific process. The framework decomposes the observation-to-action pipeline into two specialized modules: a goal-conditioned descriptor agent and an action selection agent. Each module's prompt is iteratively refined through an LLM-driven evolutionary loop that uses environment returns as its optimization signal, removing the need for human supervision or model fine-tuning.
Candidate prompts are validated through environment rollouts before being adopted.
Central to the approach are two components: a behavior analyzer that attributes episode outcomes to specific prompt components, and a mutator that proposes targeted revisions to those components. Candidate prompts are validated through environment rollouts before being adopted. The authors evaluate their pipeline on all five BabyAI tasks within the BALROG benchmark, comparing against BALROG's RobustCoTAgent under both plain and guided prompt initializations.
Results show consistent performance improvements across all tasks and initialization conditions. The most dramatic gain appears on PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success; the new framework reaches up to 72.5% success using the identical underlying LLM with optimized prompts. The authors conclude that combining a multi-agent framework with automatic prompt optimization can meaningfully enhance LLM capabilities without fine-tuning or extensive human oversight.
Key facts
- 01The framework decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent.
- 02Prompts are refined through an LLM-driven evolutionary loop guided by environment returns — no model weight updates required.
- 03A behavior analyzer attributes episode outcomes to specific prompt components; a mutator then proposes targeted revisions.
- 04Candidate prompt revisions are validated through environment rollouts before adoption.
- 05Evaluated on all five BabyAI tasks in the BALROG benchmark against the RobustCoTAgent baseline.
- 06On PutNext, the RobustCoTAgent achieves 0% success; the new framework reaches up to 72.5% success using the same LLM.
- 07Improvements are consistent across tasks under both plain and guided prompt initializations.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →