Inner Alignment

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 18 12:2
Editor
Edited
Edited
2024 Apr 18 16:9
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?
We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem.
As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment. (
Waluigi Effect
)
 
 
 
Inner Alignment - LessWrong
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all? More specifically, Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.  The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization: > We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers. Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/goals don't. Since now we have a capable system that is optimizing for a misaligned goal. To solve the inner alignment problem, some sub-problems that we woul
Inner Alignment - LessWrong
 
 

Recommendations