Mesa Optimization

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense "trying" to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing

Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

There have been some concerns that the underlying mechanism of in-context learning might be mesa-optimization, a hypothesized situation where models develop an internal optimization algorithm. But

In-context learning ability analysis, Anthropic did not observe any evidence of mesa-optimizers.

Risks from Learned Optimization: Introduction — LessWrong

This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machi…

https://www.lesswrong.com/posts/FkgsxrGf3QxhfLWHG/risks-from-learned-optimization-introduction

Mesa-Optimization - LessWrong

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.'s "Risks from Learned Optimization in Advanced Machine Learning Systems". Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection. In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense "trying" to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.[1] History Previously work under this concept was called Inner Optimizer or Optimization Daemons. Wei Dai brings up a similar idea in an SL4 thread.[2] The optimization daemons article on Arbital was published probably in 2016.[1] Jessica Taylor wrote two posts about daemons while at MIRI: * "Are daemons a problem for ideal agents?" (2017-02-11) * "Maximally efficient agents will probably have an anti-daemon immune system" (2017-02-23) See also * Inner Alignment * Complexity of value * Thou Art Godshatter External links Video by Robert Miles Some posts that reference optimization daemons: * "Cause prioritization for downside-focused value systems": "Alternatively, perhaps goal preservation becomes more difficult the more capable AI systems become, in which case the future might be controlled by unstable goal functions taking turns over the steering wheel" * "Techniques for optimizing worst-case performance": "The difficulty of optimizing worst-case performance is one of the most likely re

https://www.lesswrong.com/tag/mesa-optimization

Mesa Optimization

Backlinks

Recommendations