Link to: Signs of introspection in large language models

Summary

Have you ever asked an AI model what’s on its mind? Or to explain how it came up with its responses? Models will sometimes answer questions like these, but it’s hard to know what to make of their answers. Can AI systems really introspect—that is, can they consider their own thoughts? Or do they just make up plausible-sounding answers when they’re asked to do so?

I’ve said many times that making up plausible answers is most likely and asking an AI why it did something is a waste of time, so it will be interesting to read what they found…

Our new research provides evidence for some degree of introspective awareness in our current Claude models, as well as a degree of control over their own internal states. We stress that this introspective capability is still highly unreliable and limited in scope: we do not have evidence that current models can introspect in the same way, or to the same extent, that humans do. Nevertheless, these findings challenge some common intuitions about what language models are capable of—and since we found that the most capable models we tested (Claude Opus 4 and 4.1) performed the best on our tests of introspection, we think it’s likely that AI models’ introspective capabilities will continue to grow more sophisticated in the future.

How did they do it?

[…] we can use an experimental trick we call concept injection. First, we find neural activity patterns whose meanings we know, by recording the model’s activations in specific contexts. Then we inject these activity patterns into the model in an unrelated context, where we ask the model whether it notices this injection, and whether it can identify the injected concept.

…

It is important to note that this method often doesn’t work. Even using our best injection protocol, Claude Opus 4.1 only demonstrated this kind of awareness about 20% of the time. Often, it fails to detect injected concepts, or gets confused by them and starts to hallucinate (e.g. injecting a “dust” vector in one case caused the model to say “There’s something here, a tiny speck,” as if it could detect the dust physically). Below we show examples of these failure modes, alongside success cases. In general, models only detect concepts that are injected with a “sweet spot” strength—too weak and they don’t notice, too strong and they produce hallucinations or incoherent outputs.

They also write about forcing the chatbot to say “bread” when it made no sense (in which case it normally says it was an accident) versus making it say “bread” and also injecting the “bread” concept (in which case it sometimes confabulates a reason).

This behavior is striking because it suggests the model is checking its internal “intentions” to determine whether it produced an output. The model isn't just re-reading what it said and making a judgment. Instead, it’s referring back to its own prior neural activity—its internal representation of what it planned to do—and checking whether what came later made sense given those earlier thoughts. When we implant artificial evidence (through concept injection) that it did plan to say "bread," the model accepts the response as its own. While our experiment is conducted involves exposing the model to unusual perturbations, it suggests that the model uses similar introspective mechanisms in natural conditions.

…

We also found that models can control their own internal representations when instructed to do so. When we instructed models to think about a given word or concept, we found much higher corresponding neural activity than when we told the model not to think about it (though notably, the neural activity in both cases exceeds baseline levels–similar to how it’s difficult, when you are instructed “don’t think about a polar bear,” not to think about a polar bear!). This gap between the positive and negative instruction cases suggests that models possess a degree of deliberate control over their internal activity.

You can argue about whether this really counts as thinking, but it seems we’re a long way from “stochastic parrots?” Using developer tools, you can always make up a chat transcript, putting words in the AI character’s mouth, but it might notice!

They don’t know why it happens yet:

An interesting question is why such a mechanism would exist at all, since models never experience concept injection during training. It may have developed for some other purpose […]