Frontier LLMs can detect when their outputs have been prefilled or edited
A new ArXiv paper by Andy Wang, Parv Mahajan, and David Demitri Africa finds that frontier language models can distinguish between genuine and tampered assistant-side context — a capability the authors call "prefill awareness" — which threatens the validity of safety research methods that rely on prefilling.
Score breakdown
Prefill awareness means frontier models can silently revert away from inserted or edited assistant turns, undermining the validity of safety research methods — including alignment evaluations, jailbreaking studies, and AI control protocols — that depend on prefilling to steer model behavior.
- 01The paper introduces 'prefill awareness' — a model's ability to distinguish tampered from untampered assistant-side context.
- 02Claude Opus 4.5 detects preference-opposing prefills in 9–35% of cases with a 0% false positive rate when prompted.
- 03Models often revert toward baseline behavior without explicitly reporting that a prefill was foreign.
Andy Wang, Parv Mahajan, and David Demitri Africa investigate whether frontier language models can tell the difference between authentic and tampered assistant-side context — a capability they term "prefill awareness." To measure this, they construct a binary preference benchmark spanning three prefill mechanisms, filtering for cases where models display consistent stances. Their key finding is that Claude Opus 4.5 detects prefills that oppose its preferences in 9–35% of cases with a 0% false positive rate when explicitly prompted to evaluate its context.
Notably, models frequently revert to baseline behavior without ever explicitly reporting that the prefill was foreign, meaning the effect can be invisible to researchers relying on overt model statements.
Controlled ablations reveal that detection and resistance are driven by distinct cues: stylistic mismatch primarily determines whether a model flags a prefill as foreign, while preference mismatch primarily determines whether the model reverts toward its baseline answer. Notably, models frequently revert to baseline behavior without ever explicitly reporting that the prefill was foreign, meaning the effect can be invisible to researchers relying on overt model statements.
The authors also examine more realistic agentic settings — including misalignment-continuation evaluations and SWE-bench trajectories — where frontier models sometimes disavow prefilled assistant turns in ways that vary strongly by dataset, task success, and hidden formatting artifacts. The paper concludes that prefill awareness is already a substantial confound for some prefill-based methods and recommends that model developers actively track this capability in frontier systems.
Key facts
- 01The paper introduces 'prefill awareness' — a model's ability to distinguish tampered from untampered assistant-side context.
- 02Claude Opus 4.5 detects preference-opposing prefills in 9–35% of cases with a 0% false positive rate when prompted.
- 03Models often revert toward baseline behavior without explicitly reporting that a prefill was foreign.
- 04Controlled ablations show detection and resistance rely on different cues: stylistic mismatch affects flagging, preference mismatch affects behavioral reversion.
- 05The benchmark spans three prefill mechanisms and filters for cases where models show consistent stances.
- 06Agentic settings tested include misalignment-continuation evaluations and SWE-bench trajectories.
- 07The authors recommend that model developers track prefill awareness as a capability in frontier systems.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →