SAE Feature

SAE Latent Unit

SAE enables feature disentanglement from Superposition

Responds to both abstract discussions and concrete examples

SAE features are ideally independent, but in practice they are only sparse and remain colinear

There are features representing more abstract properties of the input, might there also be more abstract, higher-level actions which trigger behaviors over the span of multiple tokens?

SAE Feature Notion

SAE Steering

SAE Feature Universality

SAE Feature Structure

SAE Feature Distribution

SAE Feature Matching

SAE Feature Splitting

SAE Feature Visualization

SAE Feature Shrinkage

SAE Feature Absorption

SAE pathological error

SAE Bounce Plot

SAE feature importance

SAE Feature Direction

SAE Feature Stitching

SAE Feature Metrics

SAE Feature Specificity

SAE MDL

MMCS

Insane visualization of feature activation distribution with conditional distribution and log-likelihood

https://transformer-circuits.pub/2023/monosemantic-feature

What is Feature: the simplest factorization of the activations (
SAE MDL)

As investigation of superposition has progressed, it's become clear that we don't really know what a "feature" is, despite them being central to our research agenda. Given a given a sparse factorization of the activations , let’s start by defining “total information” by fitting a probability distribution to the entries of the matrices and computing its entropy. (Larger dictionaries tend to require more information to represent, but sparser codes require less information to represent, which may counterbalance.) It turns out that measuring the "Total Information” seems to be an effective tool for determining the "correct number of features”, at least for synthetic data. Anthropic observes that dictionary learning solutions "bounce" when the dictionary size matches the true number of factors. This is also where the best MMCS score. If such bounces could be found in real data, it would seem like significant evidence that there are "real features" to be found.

Circuits Updates — May 2023

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2023/may-update/index.html#simple-factorization

[Interim research report] Taking features out of superposition with sparse autoencoders — AI Alignment Forum

We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …

https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition

High Activation Density could mean either that sparsity was not properly learned, or that it is an important feature needed in various situations. In the Feature Browser, SAE features show higher feature interpretability when they have more high activation

Quantile, which demonstrates a limitation where SAE features have low interpretability for low activations and exhibit certain skewness.

However, features with the highest

Activation Density in the

Activation Distribution are less interpretable, mainly because these features typically don't have high activation values in absolute terms (not quantile). A well-classified and highly interpretable SAE feature should not show density that simply decreases with activation value, but rather should show clustering at high activation levels after an initial decrease.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features#global-analysis-interp-caveats

Automated Interpretability Metrics proposal from

EleutherAI

arxiv.org

https://arxiv.org/pdf/2410.13928

In the residual stream, features exhibit an antipodal pairing phenomenon where they are linearly paired with positive and negative values. Mathematically, if we allow negative values instead of just positive ones, we might be able to reduce the dictionary size. This antipodal pairing appears to be a simple pattern resulting from the current training structure that only uses positive values, rather than being a special feature phenomenon.

openreview.net

https://openreview.net/pdf?id=Zlx6AlEoB0

SAE Feature

SAE Latent Unit

SAE enables feature disentanglement from Superposition

What is Feature: the simplest factorization of the activations (
SAE MDL)

Backlinks

Recommendations

SAE Feature

SAE Latent Unit

SAE enables feature disentanglement from Superposition

What is Feature: the simplest factorization of the activations (SAE MDL)

Backlinks

Recommendations

What is Feature: the simplest factorization of the activations (
SAE MDL)