Deliberative Alignment

Creator

Creator

Seonglae Cho

Created

Created

2024 Dec 21 14:39

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Dec 22 22:57

Refs

Refs

Safety specifications

Inference and filter correct answers

SFT training (BC)

RL training by judge reward model

They avoid applying direct optimization pressure on the CoT during RL to enable the underlying model to reduce the chance of encouraging deceptive CoTs

notion image

notion image

SFT

notion image

RL

notion image

Deliberative Alignment: Reasoning Enables Safer Language Models

Introducing our new alignment strategy for o1 models, which are directly taught safety specifications and how to reason over them.

https://openai.com/index/deliberative-alignment/

Deliberative Alignment: Reasoning Enables Safer Language Models

assets.ctfassets.net

https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/0aedc43a8f2d1e5c71c5e114d287593f/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024_3.pdf

Backlinks

AI Jailbreak AI Scheming CoT Auditing

Recommendations

/////////