CICL framework scores context by decision impact for tool-using LLM agents
Researchers Xinyu Guan, Qianyang Zhao, and Yuming Deng introduce CICL, a decision-aware context layer that selects and compresses evidence into typed memory cards for tool-using LLM agents, improving file-retrieval hit@1 from 0.58 to 0.78 on 50 SWE-bench Verified instances.
Score breakdown
CICL's separation of the decision signal from the judge model means frontier annotators, local surrogates, and lightweight rankers can be benchmarked under one auditable protocol, providing a reproducible measurement layer for decision-critical context selection in tool-using LLM agents.
- 01CICL is a decision-aware context layer that scores evidence units by action shift, outcome uplift, necessity, and negative-transfer risk.
- 02Evidence is packed into typed memory cards delivered to a budgeted agent via a shared eight-field schema.
- 03On 50 SWE-bench Verified file-retrieval instances, Qwen3.6-plus reranking of BM25 top-50 candidates raised hit@1 from 0.58 to 0.78.
Xinyu Guan, Qianyang Zhao, and Yuming Deng introduce CICL, a decision-aware context layer that reframes context selection for tool-using LLM agents as a measurement problem. Rather than treating retrieval as a relevance task, CICL builds a context graph from instance evidence and scores each unit along four dimensions: action shift, outcome uplift, necessity, and negative-transfer risk. High-utility units are packed into typed memory cards delivered to a budgeted agent. A key architectural choice is that the decision signal is separated from the judge model, allowing frontier annotators (Opus-assisted), local surrogates (Qwen, Qwen-QLoRA), and lightweight rankers (Codex/GPT-5.5) to be compared under a shared eight-field schema and a single auditable protocol.
Controlled diagnostics at budget 120 show CICL reaching F1 0.620 on v1 and 0.425 on v3; removing the top-utility semantic v3 unit collapses F1 to 0.000, demonstrating action-criticality.
Empirically, the paper reports that on 50 SWE-bench Verified file-retrieval instances, Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics at budget 120 show CICL reaching F1 0.620 on v1 and 0.425 on v3; removing the top-utility semantic v3 unit collapses F1 to 0.000, demonstrating action-criticality. Supplementary checks include Qwen-QLoRA agreement over 710 candidates, a 200-label real-code Opus-assisted signal, and a three-instance patch smoke test validating retrieval-to-patch plumbing. The authors candidly note that RepoBench-R summaries still outperform memory cards and that compact rankers do not yet replace the heuristic, framing CICL as a reproducible measurement and selection layer rather than a complete coding-agent solution.
Key facts
- 01CICL is a decision-aware context layer that scores evidence units by action shift, outcome uplift, necessity, and negative-transfer risk.
- 02Evidence is packed into typed memory cards delivered to a budgeted agent via a shared eight-field schema.
- 03On 50 SWE-bench Verified file-retrieval instances, Qwen3.6-plus reranking of BM25 top-50 candidates raised hit@1 from 0.58 to 0.78.
- 04MRR@10 improved from 0.634 to 0.790, with all 2,500 judgments parseable.
- 05At budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3; removing the top-utility semantic v3 unit collapses F1 to 0.000.
- 06Supplementary checks include Qwen-QLoRA agreement over 710 candidates and a 200-label Opus-assisted real-code signal.
- 07RepoBench-R summaries still outperform memory cards, and compact rankers do not yet replace the heuristic.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →