SAE MDL

Minimal Description Length

An explanation of some phenomena is a statement for which knowing gives some information about . An explanation is typically a natural language statement

The Description Length (DL) of some explanation e is given as , where is the metric denoting the number of bits needed to send the explanation through a communication channel.

For SAE, . The first term is and the second term is in the dataset size so the first term dominates in the large regime

We say an SAE is if it obtains this minimum

SAEs should be sparse, but not too sparse

Upper bound of the DL is:

where is the effective precision of each float and is the number of bits required to specify which features are active.

Sparsity is a key component of minimizing description length (DL)

There’s an inherent trade-off between decreasing L0 and decreasing the dictionary size in order to reduce description length

Minimum Description Length prefers hierarchical features

Optimizing for MDL can reduce undesirable feature splitting

Hierarchical features allow for more efficient coding schemes

How to use as metric

Although the comparison is slightly unfair because the SAE is lossy

DL per input token

GPT2 itself 5376 bits per token
SAE for GPT2 is 1405 bits per token
For one-hot encoding (dictionary has a row for each neural activation in the dataset; and ), 13, 993 bits per token

arxiv.org

https://arxiv.org/pdf/2410.11179

www.lesswrong.com

https://www.lesswrong.com/posts/G2oyFQFTE5eGEas6m/interpretability-as-compression-reconsidering-sae

SAE MDL

Minimal Description Length

SAEs should be sparse, but not too sparse

Minimum Description Length prefers hierarchical features

How to use as metric

Backlinks

Recommendations