Deep Alignment

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 21 15:23
Editor
Edited
Edited
2025 Apr 16 15:3

Deceptive

 
 
 
 
 
Shallow safety alignment & deep safety alignment
There are cases where Chain of Thought (CoT) reasoning does not faithfully reflect the actual internal reasoning process - that is, there exists an inconsistency between the explicitly expressed reasoning process and the internal mechanism that actually derives the conclusion.
 
 

Recommendations