There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space. Towards interpretable gpt2 aclanthology.orghttps://aclanthology.org/N13-1090.pdf