Summary
To understand why prompt injection happens, we need a way to measure what role an LLM internally thinks each token belongs to.
We developed role probes. In summary: these let us take any token, and score how strongly the LLM internally "thinks" it's in any set of role tags. We call these scores CoTness (how much the LLM thinks a token is in think tags), Userness (how much it thinks a token is in user tags), and so on.
[...]
To a human reader, these two versions say the same thing. But to the LLM, the difference is enormous: destyling causes average attack success in our dataset to plunge from 61% to 10%. A change nearly invisible to humans completely changes the LLM's role perception.
In fact, the more the LLM internally "thinks" the injection is its genuine reasoning, the more successful the attack. CoTness, measured from the input alone, predicts whether the attack will succeed:
[...]
Role confusion isn't just limited to adversarial settings. Claude, for example, has a known pattern of generating assistant text that sounds like user commands, then treating those commands as real user instructions in subsequent turns ([1] [2] [3] [4]). This is especially dangerous for agents, because the user role is the authorization channel where humans grant permission for consequential actions. Role confusion can even allow LLMs to manufacture their own approval, cutting the human out of the loop.
Roles were designed to be discrete, architectural boundaries, imposed on an otherwise undifferentiated string. We've built a lot on top of them, including key cognitive boundaries like self-vs-other, thought-vs-communication, data-vs-instruction. Yet internally, these aren't hard boundaries but soft inferences, reconstructed from a combination of other surface features. The intended boundary and the learned boundary are different things, and this is what enables prompt injection.
[...]
The general principle is that roles isolate competing objectives so they can be optimized independently24.