Persona Vector

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 19 14:58
Editor
Edited
Edited
2025 Dec 27 23:48
Refs
Refs

Role Vector

 
 
 
 
Instead of using role (persona) in prompts, extract and directly manipulate role vectors from the model's internal activation values. Create role directions using the diff-in-means of activations between role-specific data and general data:
  • Addition (activation addition) → increases performance in that domain ↑
  • Removal (directional ablation) → decreases performance in that domain ↓
Significant performance improvements in related domains (MMLU), minimal impact on unrelated domains. Internal representation manipulation is more effective than prompt-based persona. Larger models show clearer role directions.
LLM personalities (maliciousness, sycophancy, hallucination, etc.) can be represented as linear directions in activation space (persona vectors). By automatically extracting these vectors, we can monitor and control personality changes during deployment, predict and mitigate unintended personality shifts during fine-tuning, and use steering for post-hoc mitigation or proactive prevention.
 
 

Recommendations