SAE Feature Specificity

Creator

Creator

Seonglae Cho

Created

Created

2025 Feb 4 15:27

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Feb 19 13:25

Refs

Refs

When the feature is activated, the related concept reliably exists in the context

Binary Detection

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#assessing-tour-specificity

Recommendations

/////////////