SecureClaw cuts LLM agent attack success to near-zero across three benchmarks
SecureClaw is a dual-boundary security architecture for tool-using LLM agents that achieves 0% attack success rate on Agent Security Bench while preserving task utility, by combining opaque handle-based plaintext confinement with a PREVIEW→COMMIT write authorization protocol.
Score breakdown
SecureClaw is the first architecture evaluated across AgentDojo, AgentLeak, and ASB in a common harness that closes both the plaintext-exposure and unauthorized-action boundaries simultaneously, rather than trading one surface for the other.
- 01SecureClaw is authored by Yuhan Ma and Stefan Schmid and targets two distinct LLM agent security failures: unauthorized external actions and in-runtime plaintext exposure.
- 02Existing defenses protect only one boundary (planner/runtime or action sink), not both simultaneously.
- 03Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and bounded summaries, preventing direct secret dereferencing.
Yuhan Ma and Stefan Schmid introduce SecureClaw, a dual-boundary security framework targeting two failure modes that existing LLM agent defenses address only in isolation: unauthorized external actions and the exposure of sensitive plaintext within the agent runtime before any final output check can intervene. Prior defenses tend to secure either the planner/runtime layer or the action sink, but not both simultaneously, leaving a residual attack surface on whichever boundary is left unguarded.
SecureClaw's architecture places authorization at the effect sink and plaintext confinement at the read boundary.
SecureClaw's architecture places authorization at the effect sink and plaintext confinement at the read boundary. On the read side, sensitive data passes through a trusted gateway that substitutes raw values with opaque handles; in the evaluated deployment, bounded summaries serve as an explicit declassification interface, allowing the runtime to plan over symbolic references without ever directly dereferencing secrets. On the write side, any action that changes external state must go through a PREVIEW→COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy, preventing the runtime from unilaterally performing side effects.
Benchmarked across AgentDojo, AgentLeak, and Agent Security Bench (ASB) in a common evaluation harness, SecureClaw is described as the only defense evaluated that simultaneously retains usable task utility and achieves 0% attack success rate (ASR) on ASB, 0.64% ASR on AgentDojo, and 3.23% overall leak on AgentLeak's attacked parity lane — a metric that measures both final-output and internal-relay leakage.
Key facts
- 01SecureClaw is authored by Yuhan Ma and Stefan Schmid and targets two distinct LLM agent security failures: unauthorized external actions and in-runtime plaintext exposure.
- 02Existing defenses protect only one boundary (planner/runtime or action sink), not both simultaneously.
- 03Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and bounded summaries, preventing direct secret dereferencing.
- 04Writes that change external state follow a PREVIEW→COMMIT protocol; only a trusted executor may commit the authorized canonical request.
- 05On Agent Security Bench (ASB), SecureClaw achieves 0% attack success rate (ASR).
- 06On AgentDojo, SecureClaw achieves 0.64% ASR; on AgentLeak's attacked parity lane, 3.23% overall leak.
- 07It is described as the only defense evaluated in a common harness that simultaneously retains usable task utility and achieves these low ASR results.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 15, 2026 · 11:57 UTC. How this works →