AI Alignment

Creator

Creator

Seonglae Cho

Created

Created

2020 Aug 23 9:36

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Aug 1 13:0

Refs

Refs

alignment-handbook

huggingface • Updated 2023 Nov 11 9:1

Language Model RL

deep_learning_curriculum

jacobhilton • Updated 2024 Oct 4 18:22

The Goodhart's Law

Alignment Problem

Two opposing perspectives in AI development: Accelerationists who focus solely on improving intelligence, while Alignmentists work to make AI robust and interpretable. These two tribes have competed throughout AI history, with conflicts dating back further than many realize, especially in communities like

LessWrong and organizations such as

Both sides have maintained a mutually beneficial relationship, complementing each other and historically driving AI development forward.

A Maximally Curious AI Would Not Be Safe For Humanity while I don’t think so

Alignment must occur faster than the model's capabilities grow. Also, Aligned doesn’t mean perfect (Controllability, reliability). We will need another neural network to observe and interpret the internal workings of neural networks.

AI Alignment is Alignment between taught behaviors and actual behaviors. AI is aligned with an operator - AI is trying to do what operator wants to do.

The ideal virtuous and helpful AI should not be aligned with humans, nor should it mimic human flaws.

AI Alignment Notion

Natural Abstraction Hypothesis

Outer Alignment

Inner Alignment

Mesa Optimization

Waluigi Effect

Value learning

Distribution Shift

Instrumental Convergence

Multimodal Alignment

Cultural Alignment

AI Alignment Externals

Preference Optimization

Machine Unlearning

AI Power seeking

Misalignment Finetuning

Emergent Misalignment

What is AI alignment

What is it to solve the alignment problem? — LessWrong

People often talk about “solving the alignment problem.” But what is it to do such a thing? I wrote up some rough notes.

What is it to solve the alignment problem? — LessWrong

https://www.lesswrong.com/posts/AFdvSBNgN2EkAsZZA/what-is-it-to-solve-the-alignment-problem-1

What is it to solve the alignment problem? — LessWrong

AI Control names

AI Oversight + Control — ML Alignment & Theory Scholars

As model develop potential dangerous behaviors, can we develop and evaluate methods to monitor and regulate AI systems, ensuring they adhere to desired behaviors while minimally undermining their efficiency or performance?

AI Oversight + Control — ML Alignment & Theory Scholars

https://www.matsprogram.org/oversight

AI Oversight + Control — ML Alignment & Theory Scholars

Challenges

https://arxiv.org/pdf/2501.16496

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

EDIT 19/7/24: This sequence is now two years old, and fairly out of date. I hope it's still useful for historical reasons, but I no longer recommend…

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

https://arxiv.org/pdf/2404.09932

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

Why we made this list: • * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on…

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

https://www.lesswrong.com/posts/KfkpgXdgRheSRWDy8/a-list-of-45-mech-interp-project-ideas-from-apollo-research

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

Backlinks

Sleeper Agents Attack Reinforcement Learning Prompt Engineering AI Safety Level AI Scaling Model Training Monosemanticity AI Risk LLM Transformer Training AI Term Prompt Engineering Probability theory Predictive learning NLP Dataset

Recommendations

/////