Deceptive
shallow safety alignment & deep alignment
There are cases where Chain of Thought (CoT) reasoning does not faithfully reflect the actual internal reasoning process - that is, there exists an inconsistency between the explicitly expressed reasoning process and the internal mechanism that actually derives the conclusion.