Automated prompt injection attacks studied in agentic LLM settings
A paper by David Hofer, Edoardo Debenedetti, and Florian Tramèr evaluates automated prompt injection attacks against LLM agents, finding black-box methods outperform gradient-based ones but that attacks on smaller models fail to transfer to frontier models like GPT-5.
Score breakdown
The study establishes automated prompt injection as a credible but model-dependent threat to LLM agents, while identifying significant barriers — particularly the failure of smaller-model attacks to transfer to frontier models — that shape the realistic risk landscape for agentic systems.
- 01The paper evaluates automated prompt injection attacks against LLM agents using the AgentDojo framework.
- 02Both white-box (GCG) and black-box (TAP) attack methods are adapted to the agentic setting.
- 03Evaluation covers 80 task pairs spanning four domains and multiple models.
David Hofer, Edoardo Debenedetti, and Florian Tramèr conduct a comprehensive empirical evaluation of automated prompt injection attacks in realistic agentic environments, addressing a gap in the literature where such methods — already proven effective for jailbreaking — had remained underexplored for LLM agents that interact with untrusted external data. Using the AgentDojo framework, they adapt both white-box (GCG) and black-box (TAP) attack methods to the agentic setting and evaluate them across 80 task pairs spanning four domains and multiple models.
TAP's effectiveness is shown to depend heavily on the attacker model: stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts entirely.
The study finds that black-box optimization substantially outperforms gradient-based methods, with the gap attributed to GCG's optimization instability under reasonable compute budgets. TAP's effectiveness is shown to depend heavily on the attacker model: stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts entirely. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5. The authors conclude that automated prompt injection represents a credible but model-dependent threat, with significant barriers remaining for model-agnostic exploitation.
Key facts
- 01The paper evaluates automated prompt injection attacks against LLM agents using the AgentDojo framework.
- 02Both white-box (GCG) and black-box (TAP) attack methods are adapted to the agentic setting.
- 03Evaluation covers 80 task pairs spanning four domains and multiple models.
- 04Black-box optimization substantially outperforms gradient-based (GCG) methods, attributed to GCG's optimization instability under reasonable compute budgets.
- 05TAP's effectiveness depends on the attacker model — stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts.
- 06Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains.
- 07Attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 10, 2026 · 15:34 UTC. How this works →