★ Rank 09 today·Jun 11, 2026·1 min readResearch Papers

AgentBeats proposes agent-run benchmarking via A2A and MCP protocols

Researchers introduce Agentified Agent Assessment (AAA) and its implementation AgentBeats, a framework where judge agents evaluate subject agents through standardized A2A and MCP protocols, replacing fragmented LLM-centric benchmarking harnesses.

ArXiv·Xiaoyuan Liu, Jianhong Tu, Yuqi Chen

Read at source

Composite · rank 09

7.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

AAA's single-interface design separates assessment logic from agent implementation, removing the heavy integration burden of existing LLM-centric harnesses and enabling reproducible, cross-agent comparisons that current fragmented benchmarks cannot support.

01The paper proposes Agentified Agent Assessment (AAA), where judge agents perform evaluation via standardized A2A (task management) and MCP (tool access) protocols.
02AAA uses a single unified interface, replacing the two separate interfaces (benchmark-side and agent-side) required by conventional benchmarking.
03AgentBeats is the concrete implementation of AAA, offering five practical operation modes for real-world constraints on openness, privacy, and reproducibility.

Summary— our read of the original

Xiaoyuan Liu, Jianhong Tu, and Yuqi Chen argue that agent evaluation is fundamentally broken: most benchmarks rely on fixed, LLM-centric harnesses that demand heavy integration, create a test-production mismatch, and prevent fair comparison across diverse agent designs. Their proposed remedy is Agentified Agent Assessment (AAA), in which evaluation is carried out by judge agents and all participants — both evaluators and subjects — communicate through standardized protocols: A2A for task management and MCP for tool access. Crucially, conventional benchmarking defines two separate interfaces (one for the benchmark, one for the agent), whereas AAA collapses these into a single unified interface, decoupling assessment logic from agent implementation and enabling reproducible, interoperable, and multi-agent evaluation.

The authors identify five practical operation modes designed to make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility.

AgentBeats is the paper's concrete realization of AAA. The authors identify five practical operation modes designed to make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To validate the design at scale, they ran two studies: a five-month open competition that attracted 298 judge agents across 12 categories and 467 subject agents from independent participants, demonstrating that AAA generalizes across a heterogeneous range of benchmarks; and a controlled case study on coding agents, which confirmed that agentified evaluation preserves fidelity with the public record while also surfacing previously missing head-to-head comparisons and yielding new research insights about agent design. Together, the community-scale field study and the controlled case study support the authors' claim that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale.

Key facts

01The paper proposes Agentified Agent Assessment (AAA), where judge agents perform evaluation via standardized A2A (task management) and MCP (tool access) protocols.
02AAA uses a single unified interface, replacing the two separate interfaces (benchmark-side and agent-side) required by conventional benchmarking.
03AgentBeats is the concrete implementation of AAA, offering five practical operation modes for real-world constraints on openness, privacy, and reproducibility.
04A five-month open competition attracted 298 judge agents across 12 categories and 467 subject agents from independent participants.
05A coding-agent case study confirmed agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results.
06Authors are Xiaoyuan Liu, Jianhong Tu, and Yuqi Chen; the paper was published on ArXiv on 2026-06-11.

Topics

#benchmarks #agent-framework #a2a #mcp #reproducibility

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →

AgentBeats proposes agent-run benchmarking via A2A and MCP protocols

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.