Different models learn similar features and circuits when trained on similar tasks.
Even the discovery of similar circuits in humans and AI supports this claim
Neuron Activation in Left Prefrontal cortex respond to work such as AI Neuron Activation (actually word embedding in the paper)
Semantic encoding during language comprehension at single-cell resolution
After matching feature neurons across models via activation correlation, they apply representational space similarity metrics like Singular Value Canonical Correlation. Their experiments reveal similarities in SAE feature spaces across various LLMs, providing evidence for feature universality.