Not only Deliberative Alignment cot think too. CoT monitoring is significantly more effective than monitoring outputs/actions alone. Current-scale RL training does not significantly harm CoT monitoring. CoT monitoring is not a replacement but a complement to mechanistic interpretability.
CoT Auditing
Creator
Creator
Seonglae ChoCreated
Created
2025 Dec 22 22:57Editor
Editor
Seonglae ChoEdited
Edited
2025 Dec 22 22:58Refs
Refs

