- Safety specifications
- Inference and filter correct answers
- SFT training (BC)
- RL training by judge reward model
- They avoid applying direct optimization pressure on the CoT during RL to enable the underlying model to reduce the chance of encouraging deceptive CoTs