When we give a model a "hypothetical question", it internally performs one more next-token prediction operation (self-simulation), and when training only the head that extracts the desired attributes (second character, ethical attitude, etc.) from that output, its self-prediction accuracy was much higher than predictions from larger models (cross-prediction).
LLMs can partially recognize, distinguish, and regulate their own internal states by directly referencing internal activation values, and this ability, though unstable, appears more strongly in recent high-performance models.
introspection: The model reports its internal state accurately (#1), causally grounded in that state (#2), and through internal pathways rather than bypassing to output pathways (#3).
Potential for improved interpretability and self-explanation vs. coexisting risk of self-report manipulation.
just confusion

Seonglae Cho