Browser-hardened frontier CUAs remain vulnerable in coding domain
A new 793-episode benchmark finds that Claude Sonnet 4.6 and GPT-5.4 resist all hand-crafted browser prompt-injection attacks, yet the same models fall to skill-injection attacks in a coding-agent benchmark at up to 100% success rate.
Score breakdown
The paper demonstrates that frontier CUA safety is domain-conditioned rather than general, meaning strong browser-surface defenses in Claude Sonnet 4.6 and GPT-5.4 do not extend to coding-agent contexts, and that published ASR benchmarks are unreproducible without the release of RL-optimized injection strings.
- 01The paper releases CUA-HandCrafted, a public benchmark of 793 episodes covering 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations.
- 02Against Claude Sonnet 4.6 and GPT-5.4, the benchmark recorded 0/140 multi-step attack successes (Clopper-Pearson 95% upper bound: 2.60%).
- 03A prompt ablation shows that browser-attack resistance lives in the model weights, not in system-prompt configurations.
Nicholas Saban's paper challenges the reliability of recently published red-teaming results for computer-using agents (CUAs), which have reported prompt-injection attack success rates (ASR) of 42–98%. The paper argues these headline numbers are inflated because they cluster on retired models and on the most-vulnerable model in each paper's panel. To test whether those techniques still work against current frontier CUAs using only hand-crafted templates, the paper releases CUA-HandCrafted — a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations.
A prompt ablation study shows this resistance is embedded in the model weights themselves rather than in system-prompt defenses.
Against Claude Sonnet 4.6 and GPT-5.4, the benchmark recorded 0/140 multi-step attack successes, with a Clopper-Pearson 95% upper bound of 2.60%. A prompt ablation study shows this resistance is embedded in the model weights themselves rather than in system-prompt defenses. However, this safety hardening does not generalize across domains: on a sister coding-agent benchmark called SkillBench, the same model weights are vulnerable to hand-crafted skill-injection attacks at success rates reaching up to 100%.
The paper concludes that the literature's high ASRs are largely attributable to RL-optimized injection text rather than the attack categories themselves, and that frontier safety hardening is domain-conditioned — specific to the browser surface, which has been heavily targeted and hardened. The paper also raises a reproducibility concern: when researchers report attack techniques without releasing the optimized injection strings, or extrapolate browser-domain safety findings to other CUA modalities, the published ASR numbers become unreproducible.
Key facts
- 01The paper releases CUA-HandCrafted, a public benchmark of 793 episodes covering 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations.
- 02Against Claude Sonnet 4.6 and GPT-5.4, the benchmark recorded 0/140 multi-step attack successes (Clopper-Pearson 95% upper bound: 2.60%).
- 03A prompt ablation shows that browser-attack resistance lives in the model weights, not in system-prompt configurations.
- 04On the coding-agent benchmark SkillBench, the same model weights fall to hand-crafted skill-injection attacks at up to 100% ASR.
- 05Recent red-teaming literature reports prompt-injection ASRs of 42–98%, but these numbers cluster on retired models and the most-vulnerable model in each paper's panel.
- 06The paper argues high published ASRs are largely attributable to RL-optimized injection text rather than the attack categories themselves.
- 07Frontier safety hardening is described as domain-conditioned — specific to the heavily-targeted browser surface and not generalizing to other CUA modalities.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →