Waluigi Effect

Reward misspecification problem

A phenomenon where things progress in an unexpected direction

Waluigi is Luigi's "villain version" - a character created specifically to be "Luigi's anti" since Luigi originally had no enemies. This illustrates the concept that when a good character is defined, its opposite (

Antagonist ) conceptually emerges

This means that when a concept repeatedly appears as good, the model can implicitly form its opposite (negative character)

Paperclip maximizer is Nick Bostrom's thought experiment

If you input a simple goal reward to a superintelligence, it will use any means necessary to achieve it, potentially consuming all of Earth's iron and energy to continue production indefinitely

The risk of superintelligence is not simply at the level of a Paperclip maximizer

Today's AIs Aren't Paperclip Maximizers. That Doesn't Mean They're Not Risky | AI Frontiers

Peter N. Salib, May 21, 2025 — Classic arguments about AI risk imagined AIs pursuing arbitrary and hard-to-comprehend goals. Large Language Models aren't like that, but they pose risks of their own.

https://www.ai-frontiers.org/articles/todays-ais-arent-paperclip-maximizers

Today's AIs Aren't Paperclip Maximizers. That Doesn't Mean They're Not Risky | AI Frontiers

The Waluigi Effect (mega-post) - LessWrong

Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung …

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

Waluigi Effect

Reward misspecification problem

Paperclip maximizer is Nick Bostrom's thought experiment

Recommendations