SAE Feature Structure

Creator

Creator

Seonglae Cho

Created

Created

2025 Jan 8 20:25

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Mar 6 0:8

Refs

Refs

Multiple Correlation

Partial correlation

SAE latent vectors are not independent, but rather form clusters that activate together in predictable ways. While functionally separate, there are actual dependencies, making interactions and compositional characteristics important for interpretability. This is particularly evident in smaller SAEs, and these clusters can be effectively analyzed through L0 regularization.

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces — LessWrong

Matthew A. Clarke, Hardik Bhatnagar and Joseph Bloom

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces — LessWrong

https://www.lesswrong.com/posts/WNoqEivcCSg8gJe5h/compositionality-and-ambiguity-latent-co-occurrence-and

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces — LessWrong

SAE Latent Cooccurrence Explorer (

sae_cooccurrence

MClarke1991 • Updated 2025 Mar 25 10:47

)

When two features frequently activate at the same time, we say they co-occur (high correlation)

Feature Cooccurrence Explorer

This app was built in Streamlit! Check it out and visit https://streamlit.io for more awesome community apps. 🎈

Feature Cooccurrence Explorer

https://feature-cooccurrence.streamlit.app/

Feature Cooccurrence Explorer

Topological Data Analysis

Graph Modeling of SAE features displayed Relationship relevant features developing along the layers and latter layers involves more complex features.

Topological Data Analysis and Mechanistic Interpretability — LessWrong

This article was written in response to a post on LessWrong from the Apollo Research interpretability team. This post represents our initial attempts…

Topological Data Analysis and Mechanistic Interpretability — LessWrong

https://www.lesswrong.com/posts/6oF6pRr2FgjTmiHus/topological-data-analysis-and-mechanistic-interpretability

Topological Data Analysis and Mechanistic Interpretability — LessWrong

Recommendations

/////////////