Command Palette

Search for a command to run...

AUAgentic Universe

A calmer way to keep up with the agentic stack. Every story links back to its source.

Trust

Methodology
Sources
Corrections
Attribution

Read

Today
Archive
Best
Weekly
Monthly
Daily digest
Docs
Embed widget
RSS · JSON

Legal

Terms
Refund
Privacy
DMCA

Telegram ↗Built in the open ↗

Agentic Universe

Today Weekly Monthly Archive Learn

Command Palette

Search for a command to run...

Archive·258 stories·Jun 2026 – Jun 2026·Updated 02:18 UTC

Archive

Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.

Filters· 1

Active · 1Clear all

category:Research Papers

Date range

Min scoreAny

0510

HalBench v2.3 tests 29 OSS models on sycophancy resistance

HalBench v2.3 shows that sycophancy resistance is largely decoupled from model size and architecture, with a ~27B model outperforming models up to 402B and several closed frontier models on false-premise pushback.

Read at source ↗

5.9

NICD

Jun 15, 2026·ru/the_daily_cal·Research Papers·1 min read

UC Berkeley's Agents' Last Exam stumps top AI models, with GPT-5.5 topping out at 24%

ALE's sub-25% pass rates across all leading models reveal a substantial gap between current AI capabilities and reliable real-world task performance across professional domains.

Read at source ↗

5.4

NICD

Jun 15, 2026·aTianyi Lin, Chuanyu Sun, Jingyi Zhang·Research Papers·1 min read

CSTS framework builds reusable skill trees to boost LLM agent performance

CSTS addresses a core bottleneck in agentic LLM development by replacing manual skill engineering with an automated, multi-model collective process that explicitly tests whether skills transfer across models — a property the paper identifies as critical for robust generalization.

Read at source ↗

5.8

NICD

Jun 15, 2026·aLai Jiang, Cheng Qian, Zhenhailong Wang·Research Papers·1 min read

ACCORD framework boosts LLM agent task completion by up to 20.6 points

ACCORD demonstrates that a training-free grounding layer can close a substantial portion of the task-completion gap in LLM agents across both digital and embodied benchmarks, without modifying the underlying model.

Read at source ↗

W245 stories · Jun 8–14

6.4
Jun 14, 2026·Xuanle Zhao, Qiushi Sun, Jingyu Xiao·Research Papers·1 min read
Survey maps multimodal code intelligence beyond text-to-code
The survey provides the first structured taxonomy of Multimodal Code Intelligence, connecting mature code-generation benchmarks to emerging agentic settings and identifying verification gaps that current text-to-code evaluations do not address.
Read at source ↗
6.3
Jun 14, 2026·Ismail Hossain, Sai Puppala, Md Jahangir Alam·Research Papers·1 min read
SkillVetBench uses LLM-as-Judge to catch agent skill threats static scanners miss
Existing code-layer scanners miss between 89% and 100% of instruction-layer threats like Prompt Injection and Memory Poisoning in LLM agent skills, and SKILLVETBENCH's LLM-as-Judge approach closes that gap with zero false negatives across 78 confirmed-malicious skills in benchmark testing.
Read at source ↗
6.7
Jun 14, 2026·Shreshth Rajan·Research Papers·1 min read
Study finds 28.5% of SWE-bench tasks hackable by incorrect patches
Systematic reward hackability at this scale means frontier models trained or evaluated on SWE-bench Verified and R2E-Gym may be earning inflated Pass@1 scores on a measurable fraction of tasks, undermining the reliability of these benchmarks as signals of true coding ability.
Read at source ↗
7.4
Jun 13, 2026·Hongtao Lyu, Dingyan Zhang, Mingyu Wu·Research Papers·1 min read
CoAgent brings serializable concurrency control to multi-agent LLM systems
CoAgent replaces the abort-and-retry waste of OCC and the blocking delays of 2PL with an advisory protocol that lets LLM agents self-repair conflicts, achieving serializable correctness while preserving meaningful concurrency gains that classical mechanisms cannot sustain.
Read at source ↗
5.9
Jun 12, 2026·Zhongyuan Wang, Pratyusha Vemuri·Research Papers·1 min read
LLM agents defer blindly to GNN tools, and stronger models defer more
The findings show that agent+tool evaluations cannot assume the agent adds judgment on top of the tool — and that the gap between parrot behavior and optimal action widens, not shrinks, as LLM capability scales.
Read at source ↗

Page 5 of 26·Showing 41–50 of 258

←1…456…26 →

Older stories →