ProvenanceGuard catches cross-source attribution errors in MCP agents
ProvenanceGuard is a source-aware factuality verifier for MCP-based LLM agents that detects "cross-source conflation" — where a claim is supported by evidence but attributed to the wrong source — achieving block F1 of 0.802 and source accuracy of 0.858 on a held-out medical-domain benchmark.
Score breakdown
The paper demonstrates that source attribution is an independent axis of factuality verification — meaning standard source-blind metrics can pass answers that contain incorrect attributions, a gap ProvenanceGuard is designed to close in MCP-based agents.
- 01ProvenanceGuard targets 'cross-source conflation' — a claim supported by evidence but attributed to the wrong source in MCP-based LLM agents.
- 02The system decomposes answers into atomic claims, routes them to source-specific evidence, and checks support via NLI and a token-alignment proxy.
- 03Evaluation covers 281 medical-domain MCP-agent traces; 361 held-out claim labels are human-verified.
Alvarez, Rajan, and Mugel identify a gap in standard factuality evaluation for tool-using LLM agents: existing metrics test whether an answer is supported by pooled evidence, but do not verify that each claim is attributed to the correct source. The paper names this failure mode "cross-source conflation" and argues it is an independent axis of factuality verification, particularly consequential in multi-source settings such as medical agents that draw on search, APIs, databases, clinical records, and formulary tools via the Model Context Protocol (MCP).
ProvenanceGuard addresses this by consuming captured MCP traces that carry stable tool IDs, source IDs, and raw outputs.
ProvenanceGuard addresses this by consuming captured MCP traces that carry stable tool IDs, source IDs, and raw outputs. It decomposes answers into atomic claims, routes each claim to source-specific evidence, checks factual support using Natural Language Inference (NLI) and a token-alignment proxy, and compares stated attribution against the routed source. The system returns per-claim verdicts and an answer-level allow/block decision; blocked answers can be repaired through retrieval-augmented answer revision and then re-verified.
The evaluation uses 281 medical-domain MCP-agent traces, with a 266-trace adjudicated subset yielding 2,325 LLM-assisted claim labels and 361 human-verified held-out labels. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark, block F1 rises to 0.846, though source-plus-relation accuracy drops to 0.229, indicating that exact source ownership remains difficult when sources are semantically close. The repair-and-reverify pipeline resolves all blocked answers in the full trace set, and in 50 controlled clinical conflation probes the system detects all injected attribution swaps with no retained wrong attribution.
Key facts
- 01ProvenanceGuard targets 'cross-source conflation' — a claim supported by evidence but attributed to the wrong source in MCP-based LLM agents.
- 02The system decomposes answers into atomic claims, routes them to source-specific evidence, and checks support via NLI and a token-alignment proxy.
- 03Evaluation covers 281 medical-domain MCP-agent traces; 361 held-out claim labels are human-verified.
- 04On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims.
- 05On a harder multi-source benchmark, block F1 reaches 0.846, but source-plus-relation accuracy drops to 0.229 with semantically close sources.
- 06A repair-and-reverify mechanism resolves all blocked answers in the full trace set.
- 07In 50 controlled clinical conflation probes, the system detects all injected attribution swaps with no retained wrong attribution.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →