Deep Alignment

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 21 15:23
Editor
Edited
Edited
2025 Mar 14 17:28

Deceptive

 
 
 
 
 
shallow safety alignment & deep alignment
There are cases where Chain of Thought (CoT) reasoning does not faithfully reflect the actual internal reasoning process - that is, there exists an inconsistency between the explicitly expressed reasoning process and the internal mechanism that actually derives the conclusion.
 
 

Recommendations