Archive | Agentic Universe

Jun 13, 2026·Yixuan Wang, Yiyang Zhou, Yiming Liang·Research Papers·1 min read

ASSAY framework boosts LLM agents by matching skills to tasks at inference time

ASSAY demonstrates that matching skills to tasks at inference time — rather than global library curation — is the key bottleneck for experience-based agent improvement, achieving state-of-the-art results on two benchmarks without any weight updates.

Jun 14, 2026·Zijian Carl Ma, Sean J. Wang, Sijbren Kramer·Research Papers·1 min read

DeepRoot cuts hallucinations from 87% to 7–10% in historical drug discovery

DeepRoot is the first system to simultaneously achieve low hallucination rates (7–10%) and high reasoning coherence on historical medical text, demonstrating a viable path for converting pre-ontological archives into verifiable drug-discovery leads at scale.

Read at source ↗

Jun 14, 2026·Linghua Zhang, Jun Wang, Jingtong Wu·Research Papers·1 min read

RetailBench tests LLM agents on 180-day supermarket management

RetailBench exposes that current LLMs cannot sustain coherent long-horizon decision-making in economically grounded environments, with most models failing to complete even a 180-day simulation and all falling substantially short of an oracle policy on net worth and sales.

Read at source ↗

Jun 14, 2026·Wasi Uddin Ahmad, Nikolai Ludwig, Somshubra Majumdar·Research Papers·1 min read

Open-SWE-Traces dataset ships 207K multilingual agent trajectories

Open-SWE-Traces provides a large-scale, permissively licensed, multilingual trajectory dataset that enables fine-tuning of open-source LLMs for autonomous software engineering — directly addressing the data scarcity the paper identifies as the primary bottleneck on this path.

Read at source ↗

Jun 12, 2026·Kirill Vasilevski, Ximing Dong, Benjamin Rombaut·Research Papers·1 min read

Agentic judging pipeline boosts code LLM architectural reasoning by up to 540%

The pipeline replaces prohibitively expensive manual architectural labeling with a scalable agentic approach, enabling fine-tuned models to achieve dramatically higher SWE-bench Verified resolved rates than either the base model or unfiltered fine-tuning.

Read at source ↗

Jun 13, 2026·Daksh Mittal, Tommaso Castellani, Thomson Yen·Research Papers·1 min read

LatentGym benchmarks how LLM agents learn across related tasks

LatentGym fills a gap left by existing frameworks by providing the first controllable latent structure and disentangled exploration/exploitation metrics for measuring cross-task experiential learning in LLM agents.

Read at source ↗

Jun 15, 2026·Zhihan Zhang, Alexander Le Metzger, Jiuyang Lyu·Research Papers·1 min read

LLM agent beats human experts at MCU model optimization via hardware-in-the-loop feedback

The work shows that real hardware feedback is the critical missing ingredient for LLM agents to autonomously replace expert-driven MCU optimization, turning a previously manual, multidimensional process into a closed-loop pipeline that outperforms human experts within seven iterations.

Read at source ↗

Jun 15, 2026·Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh·Research Papers·1 min read

AgentFairBench measures demographic bias in LLM agent actions

AgentFairBench shifts fairness measurement from LLM text outputs to agent decisions in consequential domains, and its arity-matched null methodology corrects a ~2.4× overstatement of disparity that prior comparison approaches produce.

Read at source ↗

Jun 15, 2026·Peiyang Xu, Bangzheng Li, Sijia Liu·Research Papers·1 min read

ContextRL boosts LLM grounding via contrastive context selection

The contrastive context-selection objective demonstrably outperforms simply adding more contrastive data, showing that how the training signal is structured — not just what data is used — drives grounding improvements in both agentic and multimodal LLM settings.

Read at source ↗

Jun 15, 2026·Rahul Khedar, Eshita, Sneha Teja Sree Reddy Thondapu·Research Papers·1 min read

StateGen cuts tool-call hallucinations to 9.66/10 in synthetic training data

StateGen's backend-is-truth invariant eliminates tool-call hallucinations by construction — a problem the paper identifies as the dominant failure class in tool-augmented LLM training data — while combining capabilities (multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring) that no single publicly available platform currently offers together.

Read at source ↗