Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Neuron SAE/
MDL SAE
Search

MDL SAE

Creator
Creator
Seonglae Cho
Created
Created
2024 Nov 18 19:44
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Apr 15 1:1
Refs
Refs
VBR
SAE MDL
어떤 특징은 binary 정보만 포함하고, 다른 특징은 정밀도가 더 높은 정보가 필요할 수 있음
 
 
 
 
 
 
 
Overcomplete basis of SAEs result in multiple ways of interpretability (problem statement)
Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution — LessWrong
This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort. …
Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution — LessWrong
https://www.lesswrong.com/posts/vNCAQLcJSzTgjPaWS/standard-saes-might-be-incoherent-a-choosing-problem-and-a
Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution — LessWrong
reconsidering
MDL
arxiv.org
https://arxiv.org/pdf/2410.11179
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs — LessWrong
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise.
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs — LessWrong
https://www.lesswrong.com/posts/G2oyFQFTE5eGEas6m/interpretability-as-compression-reconsidering-sae
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs — LessWrong
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Neuron SAE/
MDL SAE
Copyright Seonglae Cho