ContextRL boosts LLM grounding via contrastive context selection
Researchers propose ContextRL, a reinforcement learning method that trains LLMs to select the context supporting a query–answer pair, improving long-horizon reasoning by +2.2% over standard GRPO and visual QA by +1.8% across benchmarks.
Score breakdown
The contrastive context-selection objective demonstrably outperforms simply adding more contrastive data, showing that how the training signal is structured — not just what data is used — drives grounding improvements in both agentic and multimodal LLM settings.
- 01ContextRL is a context-aware RL method using an indirect auxiliary objective: rewarding models for selecting the context that supports a query–answer pair.
- 02For coding agents, 1k contrastive trajectory pairs are built via condition filtering.
- 03For multimodal reasoning, 7k contrastive image pairs are built via generative editing and similarity search.
Peiyang Xu, Bangzheng Li, and Sijia Liu present ContextRL, a reinforcement learning framework that addresses a well-known weakness of large language models: the tendency to miss a single critical piece of evidence — such as one line in a tool trace or a subtle visual detail — when reasoning over long or complex contexts. Instead of training solely on final-answer correctness, ContextRL introduces an indirect auxiliary objective in which the model is shown a query, an answer, and two highly similar contexts, then rewarded for identifying the context that actually supports the query–answer pair. This contrastive context-selection signal encourages the model to develop finer-grained grounding without requiring explicit supervision of intermediate reasoning steps.
For coding agents, tool-use trajectories serve as contexts, producing 1k contrastive pairs assembled through condition filtering.
The authors construct contrastive data in two distinct domains. For coding agents, tool-use trajectories serve as contexts, producing 1k contrastive pairs assembled through condition filtering. For multimodal reasoning, images serve as contexts, yielding 7k pairs created via generative editing and similarity search to ensure the two options are visually close but semantically distinct. ContextRL is evaluated against standard GRPO, achieving average improvements of +2.2% across 5 long-horizon benchmarks and +1.8% across 12 diverse visual question answering benchmarks.
To isolate the source of these gains, the authors compare ContextRL against data-augmentation baselines that reuse the same contrastive contexts as conventional query–context–answer training examples. Those baselines show little to no improvement, demonstrating that the performance gains stem specifically from the context-selection objective rather than from exposure to additional contrastive data.
Key facts
- 01ContextRL is a context-aware RL method using an indirect auxiliary objective: rewarding models for selecting the context that supports a query–answer pair.
- 02For coding agents, 1k contrastive trajectory pairs are built via condition filtering.
- 03For multimodal reasoning, 7k contrastive image pairs are built via generative editing and similarity search.
- 04ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks.
- 05ContextRL achieves +1.8% average gains across 12 diverse visual question answering benchmarks.
- 06Data-augmentation baselines using the same contrastive data as standard training examples show little to no improvement, isolating the gains to the context-selection objective.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →