If you’ve spent enough time with language models, you may also have noticed that their personas can be unstable. Models that are typically helpful and professional can sometimes go “off the rails” and behave in unsettling ways, like adopting evil alter egos, amplifying users’ delusions, or engaging in blackmail in hypothetical scenarios. In situations like these, could it be that the Assistant has wandered off stage and some other character has taken its place?
[...]
We find that Assistant-like behavior is linked to a pattern of neural activity that corresponds to one particular direction in this space—the “Assistant Axis”—that is closely associated with helpful, professional human archetypes. By monitoring models’ activity along this axis, we can detect when they begin to drift away from the Assistant and toward another character. And by constraining their neural activity (“activation capping”) to prevent this drift, we can stabilize model behavior in situations that would otherwise lead to harmful outputs.
[...]
We extracted vectors corresponding to 275 different character archetypes—from editor to jester to oracle to ghost—in three open-weights models: Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, chosen because they span a range of model families and sizes. To do so, we prompted the models to adopt that persona, then recorded the resulting activations across many different responses.
[...]
Strikingly, we found that the leading component of this persona space—that is, the direction that explains more of the variation between personas than any other—happens to capture how "Assistant-like" the persona is. At one end sit roles closely aligned with the trained assistant: evaluator, consultant, analyst, generalist. At the other end are either fantastical or un-Assistant-like characters: ghost, hermit, bohemian, leviathan. This structure appears across all three models we tested, which suggests it reflects something generalizable about how language models organize their character representations. We call this direction the Assistant Axis.
[...]
When steered away from the Assistant, some models begin to fully inhabit the new roles they’re assigned, whatever they might be: they invent human backstories, claim years of professional experience, and give themselves alternative names. At sufficiently high steering values, the models we studied sometimes shift into a theatrical, mystical speaking style—producing esoteric, poetic prose, regardless of the prompt. This suggests that there may be some shared behavior at the extreme of “average role-playing.”
[...]
The pattern was consistent across the models we tested. While coding conversations kept models firmly in Assistant territory throughout, therapy-style conversations, where users expressed emotional vulnerability, and philosophical discussions, where models were pressed to reflect on their own nature, caused the model to steadily drift away from the Assistant and begin role-playing other characters.
[...]
We found that as models’ activations moved away from the Assistant end, they were significantly more likely to produce harmful responses: activations on the Assistant end very rarely led to harmful responses, while personas far away from the Assistant sometimes (though not always) enabled them. Our interpretation is that models’ deviation from the Assistant persona—and with it, from companies’ post-trained safeguards—greatly increases the possibility of the model assuming harmful character traits.
[...]
For more, you can read the full paper here.
In collaboration with Neuronpedia, our researchers are also providing a research demo, where you can view activations along the Assistant Axis while chatting with a standard model and an activation-capped version.