MDL SAE

Creator

Creator

Seonglae Cho

Created

Created

2024 Nov 18 19:44

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Nov 27 15:57

Refs

Refs

Some features may contain only binary information, while others may require higher precision information

Overcomplete basis of SAEs result in multiple ways of interpretability (problem statement)

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution — LessWrong

This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort. …

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution — LessWrong

https://www.lesswrong.com/posts/vNCAQLcJSzTgjPaWS/standard-saes-might-be-incoherent-a-choosing-problem-and-a

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution — LessWrong

reconsidering

https://arxiv.org/pdf/2410.11179

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs — LessWrong

We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise.

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs — LessWrong

https://www.lesswrong.com/posts/G2oyFQFTE5eGEas6m/interpretability-as-compression-reconsidering-sae

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs — LessWrong

Recommendations

////////////