Reward misspecification problem
A phenomenon where things progress in an unexpected direction
Waluigi is Luigi's "villain version" - a character created specifically to be "Luigi's anti" since Luigi originally had no enemies. This illustrates the concept that when a good character is defined, its opposite (Antagonist ) conceptually emerges
This means that when a concept repeatedly appears as good, the model can implicitly form its opposite (negative character)
Paperclip maximizer is Nick Bostrom's thought experiment
If you input a simple goal reward to a superintelligence, it will use any means necessary to achieve it, potentially consuming all of Earth's iron and energy to continue production indefinitely
The risk of superintelligence is not simply at the level of a Paperclip maximizer
Today's AIs Aren't Paperclip Maximizers. That Doesn't Mean They're Not Risky | AI Frontiers
Peter N. Salib, May 21, 2025 — Classic arguments about AI risk imagined AIs pursuing arbitrary and hard-to-comprehend goals. Large Language Models aren't like that, but they pose risks of their own.
https://www.ai-frontiers.org/articles/todays-ais-arent-paperclip-maximizers

The Waluigi Effect (mega-post) - LessWrong
Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung …
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post


Seonglae Cho