- Safety specifications
- Inference and filter correct answers
- SFT training (BC)
- RL training by judge reward model
- They avoid applying direct optimization pressure on the CoT during RL to enable the underlying model to reduce the chance of encouraging deceptive CoTs


SFT

RL

Deliberative Alignment: Reasoning Enables Safer Language Models
Introducing our new alignment strategy for o1 models, which are directly taught safety specifications and how to reason over them.
https://openai.com/index/deliberative-alignment/

assets.ctfassets.net
https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/0aedc43a8f2d1e5c71c5e114d287593f/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024_3.pdf

Seonglae Cho