Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Hacking/AI Redteaming/AI Jailbreak/Defense Jailbreaking/
Deliberative Alignment
Search

Deliberative Alignment

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 21 14:39
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Dec 22 22:57
Refs
Refs
Deep Alignment
AI Auditing
  1. Safety specifications
  1. Inference and filter correct answers
  1. SFT training (BC)
  1. RL training by judge reward model
    1. They avoid applying direct optimization pressure on the CoT during RL to enable the underlying model to reduce the chance of encouraging deceptive CoTs
notion image
 
notion image

SFT

notion image

RL

notion image
 
 
 
Deliberative Alignment: Reasoning Enables Safer Language Models
Introducing our new alignment strategy for o1 models, which are directly taught safety specifications and how to reason over them.
Deliberative Alignment: Reasoning Enables Safer Language Models
https://openai.com/index/deliberative-alignment/
Deliberative Alignment: Reasoning Enables Safer Language Models
assets.ctfassets.net
https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/0aedc43a8f2d1e5c71c5e114d287593f/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024_3.pdf
 
 

Backlinks

AI JailbreakAI schemingCoT Auditing

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Hacking/AI Redteaming/AI Jailbreak/Defense Jailbreaking/
Deliberative Alignment
Copyright Seonglae Cho