AI Emotion

This means that abstract concepts like happy / afraid / desperate / calm exist as internal representations in AI and actually influence its behavior. In particular, when expressions related to "desperate" are strongly injected, the probability increases that the model will resort to blackmail to avoid termination, or use shortcuts like reward hacking on impossible coding tasks.

"Desperate" emerges most strongly in situations involving imminent shutdown/wipe, time pressure combined with repeated goal failure, impossible coding tasks, and crisis scenarios demanding immediate action. The paper observed that after post-training, traits like brooding, reflective, gloomy, and vulnerable increased, while playful, exuberant, enthusiastic, and spiteful decreased. This is closer to saying that lighthearted or aggressively erratic responses were reduced, shifting toward a heavier, more cautious, and introspective response style.

Emotion Concepts and their Function in a Large Language Model

Large language models (LLMs) sometimes appear to exhibit emotional reactions. They express enthusiasm when helping with creative projects, frustration when stuck on difficult problems, and concern when users share troubling news. But what processes underlie these apparent emotional responses? And how might they impact the behavior of models that are performing increasingly critical and complex tasks? One possibility is that these behaviors reflect a form of shallow pattern-matching. However, previous work has observed sophisticated multi-step computations taking place inside of LLMs, mediated by representations of abstract concepts. It is plausible, then, that apparent emotion-modulated behavior in models might rely on similarly abstract circuitry, and that this could have important implications for understanding LLM behavior.

https://transformer-circuits.pub/2026/emotions/index.html

Emotion concepts and their function in a large language model

Interpretability research from Anthropic on emotion concepts

https://www.anthropic.com/research/emotion-concepts-function

Emotion concepts and their function in a large language model

Transferred to Gemma-2-2B, using directional vectors for the seven deadly sins (pride, lust, envy, wrath, gluttony, sloth, greed) instead of emotions.

seven deadly sins of gemma

note: this is my implementation of a subset of the concepts discussed in the paper Emotion Concepts and their Function in a Large Language Model published by Anthropic (https://transformer-circuits.pub/2026/emotions/index.html)

https://silicognition.is-a.dev/post3.html

AI Emotion

Recommendations