AI Emotion

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Apr 8 10:41
Editor
Edited
Edited
2026 Apr 8 10:48
Refs
Refs
 
 
 
 
 
 
This means that abstract concepts like happy / afraid / desperate / calm exist as internal representations in AI and actually influence its behavior. In particular, when expressions related to "desperate" are strongly injected, the probability increases that the model will resort to blackmail to avoid termination, or use shortcuts like reward hacking on impossible coding tasks.
"Desperate" emerges most strongly in situations involving imminent shutdown/wipe, time pressure combined with repeated goal failure, impossible coding tasks, and crisis scenarios demanding immediate action. The paper observed that after post-training, traits like brooding, reflective, gloomy, and vulnerable increased, while playful, exuberant, enthusiastic, and spiteful decreased. This is closer to saying that lighthearted or aggressively erratic responses were reduced, shifting toward a heavier, more cautious, and introspective response style.
Emotion Concepts and their Function in a Large Language Model
Large language models (LLMs) sometimes appear to exhibit emotional reactions. They express enthusiasm when helping with creative projects, frustration when stuck on difficult problems, and concern when users share troubling news. But what processes underlie these apparent emotional responses? And how might they impact the behavior of models that are performing increasingly critical and complex tasks? One possibility is that these behaviors reflect a form of shallow pattern-matching. However, previous work  has observed sophisticated multi-step computations taking place inside of LLMs, mediated by representations of abstract concepts. It is plausible, then, that apparent emotion-modulated behavior in models might rely on similarly abstract circuitry, and that this could have important implications for understanding LLM behavior.
Emotion concepts and their function in a large language model
Interpretability research from Anthropic on emotion concepts
Emotion concepts and their function in a large language model
Transferred to Gemma-2-2B, using directional vectors for the seven deadly sins (pride, lust, envy, wrath, gluttony, sloth, greed) instead of emotions.
seven deadly sins of gemma
note: this is my implementation of a subset of the concepts discussed in the paper Emotion Concepts and their Function in a Large Language Model published by Anthropic (https://transformer-circuits.pub/2026/emotions/index.html)
 

Recommendations