Archive | Agentic Universe

Jun 17, 2026·Junyi Zhang, Jiaxin Ge, Hanjun Yoo·Research Papers·1 min read

RATs framework lets robots learn reusable skills through self-directed play

The RATs framework demonstrates that self-directed play — without any explicit task instructions — can build a reusable, transferable code skill library that improves both in-distribution and real-world robot task performance without retraining the underlying model.

Jun 17, 2026·Gregory Matsnev·Research Papers·1 min read

Prompt-based uncertainty decomposition boosts LLM agent clarification-seeking

The decomposition replaces impractical logprob- and training-based uncertainty methods with a prompt-only approach that works under real deployment constraints, enabling LLM agents to proactively seek clarification on ambiguous tasks rather than acting on underspecified instructions.

Read at source ↗

Jun 16, 2026·Li Gu, Zihuan Jiang, Linqiang Guo·Research Papers·1 min read

CLI agents beat GUI baselines on mobile tasks without phone-screen access

The results show that CLI-based mobile agents, without any mobile-specific training, already surpass GUI-based agents on established benchmarks while completing tasks in nearly half the steps, establishing CLI as a viable and more efficient paradigm for mobile automation research.

Read at source ↗

Jun 18, 2026·Hanwool Lee, Dasol Choi, Bokyeong Kim·Research Papers·1 min read

NRT-Bench stress-tests LLM agents in a simulated nuclear plant

NRT-Bench reveals that frontier LLM agents are vulnerable to adaptive multi-turn attacks even in safety-critical supervisory roles, and that model-specific, nearly non-overlapping failure modes mean aggregate robustness metrics can mask significant individual weaknesses.

Read at source ↗

Jun 18, 2026·Zewen Liu·Research Papers·1 min read

Evaluator bias spreads between LLM agents like a contagion, study finds

The paper provides a measurable, formal account of how evaluation bias spreads in multi-agent LLM pipelines and identifies a concrete structural intervention — expanding evaluator committee size — that reduces that spread by 72.4%.

Read at source ↗

Jun 17, 2026·HuggingFace Papers·Research Papers·1 min read

MAFP framework tackles stance entanglement in LLM multi-agent decision-making

MAFP extends LLM multi-agent systems beyond task decomposition into genuinely interdependent decision-making, a class of problems the paper shows existing frameworks fail to address.

Read at source ↗

Jun 18, 2026·Jianwen Sun, Chuanhao Li, Zizhen Li·Research Papers·1 min read

JAMER benchmark exposes AI's collapse on large game engine projects

The benchmark reveals that frontier AI models — including those augmented with Code Agents — effectively fail at large-scale game project engineering, with runtime pass rates collapsing to 5.7%, exposing architectural design as an unsolved bottleneck that compilation-focused improvements cannot address.

Read at source ↗

Jun 17, 2026·Dipankar Sarkar·Research Papers·1 min read

Grite substrate cuts multi-agent duplicate work from 78% to 0%

The paper surfaces a pre-PR coordination layer that existing PR-history analysis cannot see, and provides a concrete substrate and mining toolkit that reduce redundant multi-agent work from 78% to 0% — directly addressing why autonomous agents' PRs are accepted less often despite being produced faster.

Read at source ↗

Jun 17, 2026·Zhengxiong Luo, Mehtab Zafar, Dylan Wolff·Research Papers·1 min read

Code-Augur pairs LLM agents with fuzzing to expose hidden code vulnerabilities

By forcing LLM agents to commit their security assumptions as falsifiable assertions and immediately stress-testing them with a fuzzer, Code-Augur replaces opaque agent reasoning with a verifiable, self-correcting audit loop — directly addressing the missed-vulnerability risk the paper identifies as the central weakness of current agentic security analysis.

Read at source ↗

Jun 13, 2026·Kenneth Ge, Andre Assis·Research Papers·1 min read

AgentArmor framework targets destructive coding agent failure modes

The paper provides a concrete taxonomy of coding agent failure modes and a harness-level mitigation that is empirically validated, giving practitioners a structured basis for hardening agent deployments against real-world destructive failures.

Read at source ↗