Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Hacking/AI Redteaming/AI Jailbreak/
Deep Alignment
Search

Deep Alignment

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 21 15:23
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Dec 5 10:28
Refs
Refs
Deliberative Alignment
Reasoning Model
CoT

Deceptive

 
 
 
 

ICLR 2025 oral

Shallow safety alignment & deep safety alignment
Safety Alignment Should be Made More Than Just a Few Tokens Deep
The safety alignment of current Large Language Models (LLMs) is vulnerable. Simple attacks, or even benign fine-tuning, can jailbreak aligned models. We note that many of these vulnerabilities are...
Safety Alignment Should be Made More Than Just a Few Tokens Deep
https://openreview.net/forum?id=6Mxhg9PtDE
arxiv.org
https://arxiv.org/pdf/2406.05946
There are cases where Chain of Thought (CoT) reasoning does not faithfully reflect the actual internal reasoning process - that is, there exists an inconsistency between the explicitly expressed reasoning process and the internal mechanism that actually derives the conclusion.
arxiv.org
https://arxiv.org/pdf/2503.08679
High level Conceptual
Alignment remains a hard, unsolved problem — LessWrong
This is a public adaptation of a document I wrote for an internal Anthropic audience about a month ago. Thanks to (in alphabetical order) Joshua Bats…
Alignment remains a hard, unsolved problem — LessWrong
https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem
Alignment remains a hard, unsolved problem — LessWrong
 
 
 

Backlinks

AI AlignmentDefense Jailbreaking

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Hacking/AI Redteaming/AI Jailbreak/
Deep Alignment
Copyright Seonglae Cho