When we give a model a "hypothetical question", it internally performs one more next-token prediction operation (self-simulation), and when training only the head that extracts the desired attributes (second character, ethical attitude, etc.) from that output, its self-prediction accuracy was much higher than predictions from larger models (cross-prediction).
LLMs can partially recognize, distinguish, and regulate their own internal states by directly referencing internal activation values, and this ability, though unstable, appears more strongly in recent high-performance models.
introspection: The model reports its internal state accurately (#1), causally grounded in that state (#2), and through internal pathways rather than bypassing to output pathways (#3).
Potential for improved interpretability and self-explanation vs. coexisting risk of self-report manipulation.
Emergent Introspective Awareness in Large Language Models
We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.
https://transformer-circuits.pub/2025/introspection/index.html
just confusion
Introspection or confusion? — LessWrong
I'm new to mechanistic interpretability research. Got fascinated by the recent Anthropic research suggesting that LLMs can introspect[1][2], i.e. det…
https://www.lesswrong.com/posts/kfgmHvxcTbav9gnxe/introspection-or-confusion
detecting steering
Introspection via localization — LessWrong
Recently, Anthropic found evidence that language models can "introspect", i.e. detect changes in their internal activations.[1] This was then reprodu…
https://www.lesswrong.com/posts/3HXAQEK86Bsbvh4ne/introspection-via-localization
LLMs' ability to internally distinguish whether they are in an evaluation context or actual deployment context (= evaluation awareness) increases predictably as model size grows
Evaluation Awareness Scales Predictably in Open-Weights Large Language Models — LessWrong
Authors: Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julian Tan, Kevin Zhu, Ryan Laggasse, Vasu Sharma, Ashwinee Panda …
https://www.lesswrong.com/posts/gdFHYpQ9pjMwQ3w4Q/evaluation-awareness-scales-predictably-in-open-weights

Seonglae Cho