RefGRPO closes LLM agent reflection gap with a free calibration bonus
Yinglun Zhu proposes RefGRPO, an RL augmentation that fixes LLM agents' tendency to mis-assess their own outputs by adding a free calibration bonus — no reward model or external annotation required — simultaneously improving reflection accuracy and task performance.
Score breakdown
RefGRPO demonstrates that agent self-assessment can be substantially improved without any external annotation or reward model, enabling agents to act as their own verifiers grounded in environment feedback.
- 01The paper identifies a 'reflection gap': LLM agents mis-assess their own outputs even after observing concrete environment feedback, and even for questions they answered correctly.
- 02Standard RL barely closes this gap due to a credit-assignment mismatch.
- 03RefGRPO augments standard RL with a free calibration bonus computed by contrasting the agent's reflection with the actual outcome.
Yinglun Zhu identifies a persistent "reflection gap" in LLM agents: even when an agent answers correctly, it tends to mis-assess its own performance after observing concrete environment feedback such as execution results, error messages, and tool outputs. Standard RL training barely addresses this because of a credit-assignment mismatch — the RL signal does not directly target the quality of the agent's self-assessment.
The improved calibration has two downstream benefits described in the paper.
To close this gap, the paper introduces RefGRPO, which augments standard RL algorithms with two components: a calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no reward model, LLM judge, or external annotation), and a dynamic schedule on the bonus coefficient. Evaluated on text-to-SQL tasks across five benchmarks, RefGRPO reduces the underconfidence rate from 44.4% to 7.7% and raises task accuracy from 75.1% to 76.5% relative to standard RL baselines.
The improved calibration has two downstream benefits described in the paper. First, it enables better self-improvement by using the agent's own reflections as pseudo-rewards without requiring outcome supervision. Second, it supports more effective test-time selective prediction, where the agent commits only to rollouts it has flagged as correct — effectively turning the agent into its own verifier grounded in environment feedback.
Key facts
- 01The paper identifies a 'reflection gap': LLM agents mis-assess their own outputs even after observing concrete environment feedback, and even for questions they answered correctly.
- 02Standard RL barely closes this gap due to a credit-assignment mismatch.
- 03RefGRPO augments standard RL with a free calibration bonus computed by contrasting the agent's reflection with the actual outcome.
- 04The calibration bonus requires no additional reward model, LLM judge, or external annotation.
- 05On text-to-SQL across five benchmarks, RefGRPO reduces the underconfidence rate from 44.4% to 7.7%.
- 06Task accuracy improves from 75.1% to 76.5% compared to standard RL baselines.
- 07Calibrated reflection enables self-improvement via pseudo-rewards and test-time selective prediction without outcome supervision.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 15, 2026 · 11:57 UTC. How this works →