CoT Auditing

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 22 22:57
Editor
Edited
Edited
2026 Jan 9 16:0
Refs
Refs
 
 
 
 
 
 
Not only
Deliberative Alignment
cot think too. CoT monitoring is significantly more effective than monitoring outputs/actions alone. Current-scale RL training does not significantly harm CoT monitoring. CoT monitoring is not a replacement but a complement to mechanistic interpretability.
Evaluating chain-of-thought monitorability
We introduce evaluations for chain-of-thought monitorability and study how it scales with test-time compute, reinforcement learning, and pretraining.
Evaluating chain-of-thought monitorability
 

Recommendations