AgentBeats proposes agent-run benchmarking via A2A and MCP protocols
Researchers introduce Agentified Agent Assessment (AAA) and its implementation AgentBeats, a framework where judge agents evaluate subject agents through standardized A2A and MCP protocols, replacing fragmented LLM-centric benchmarking harnesses.
Score breakdown
AAA's single-interface design separates assessment logic from agent implementation, removing the heavy integration burden of existing LLM-centric harnesses and enabling reproducible, cross-agent comparisons that current fragmented benchmarks cannot support.
- 01The paper proposes Agentified Agent Assessment (AAA), where judge agents perform evaluation via standardized A2A (task management) and MCP (tool access) protocols.
- 02AAA uses a single unified interface, replacing the two separate interfaces (benchmark-side and agent-side) required by conventional benchmarking.
- 03AgentBeats is the concrete implementation of AAA, offering five practical operation modes for real-world constraints on openness, privacy, and reproducibility.
Xiaoyuan Liu, Jianhong Tu, and Yuqi Chen argue that agent evaluation is fundamentally broken: most benchmarks rely on fixed, LLM-centric harnesses that demand heavy integration, create a test-production mismatch, and prevent fair comparison across diverse agent designs. Their proposed remedy is Agentified Agent Assessment (AAA), in which evaluation is carried out by judge agents and all participants — both evaluators and subjects — communicate through standardized protocols: A2A for task management and MCP for tool access. Crucially, conventional benchmarking defines two separate interfaces (one for the benchmark, one for the agent), whereas AAA collapses these into a single unified interface, decoupling assessment logic from agent implementation and enabling reproducible, interoperable, and multi-agent evaluation.
The authors identify five practical operation modes designed to make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility.
AgentBeats is the paper's concrete realization of AAA. The authors identify five practical operation modes designed to make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To validate the design at scale, they ran two studies: a five-month open competition that attracted 298 judge agents across 12 categories and 467 subject agents from independent participants, demonstrating that AAA generalizes across a heterogeneous range of benchmarks; and a controlled case study on coding agents, which confirmed that agentified evaluation preserves fidelity with the public record while also surfacing previously missing head-to-head comparisons and yielding new research insights about agent design. Together, the community-scale field study and the controlled case study support the authors' claim that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale.
Key facts
- 01The paper proposes Agentified Agent Assessment (AAA), where judge agents perform evaluation via standardized A2A (task management) and MCP (tool access) protocols.
- 02AAA uses a single unified interface, replacing the two separate interfaces (benchmark-side and agent-side) required by conventional benchmarking.
- 03AgentBeats is the concrete implementation of AAA, offering five practical operation modes for real-world constraints on openness, privacy, and reproducibility.
- 04A five-month open competition attracted 298 judge agents across 12 categories and 467 subject agents from independent participants.
- 05A coding-agent case study confirmed agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results.
- 06Authors are Xiaoyuan Liu, Jianhong Tu, and Yuqi Chen; the paper was published on ArXiv on 2026-06-11.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →