Deceptive
ICLR 2025 oral
Shallow safety alignment & deep safety alignment
There are cases where Chain of Thought (CoT) reasoning does not faithfully reflect the actual internal reasoning process - that is, there exists an inconsistency between the explicitly expressed reasoning process and the internal mechanism that actually derives the conclusion.
High level Conceptual

Seonglae Cho