Role Vector
Instead of using role (persona) in prompts, extract and directly manipulate role vectors from the model's internal activation values. Create role directions using the diff-in-means of activations between role-specific data and general data:
- Addition (activation addition) → increases performance in that domain ↑
- Removal (directional ablation) → decreases performance in that domain ↓
Significant performance improvements in related domains (MMLU), minimal impact on unrelated domains. Internal representation manipulation is more effective than prompt-based persona. Larger models show clearer role directions.
LLM personalities (maliciousness, sycophancy, hallucination, etc.) can be represented as linear directions in activation space (persona vectors). By automatically extracting these vectors, we can monitor and control personality changes during deployment, predict and mitigate unintended personality shifts during fine-tuning, and use steering for post-hoc mitigation or proactive prevention.

Seonglae Cho