Decision SAE

Creator
Creator
Seonglae Cho
Created
Created
2024 Oct 31 11:10
Editor
Edited
Edited
2025 Jul 14 13:50
Refs
Refs
 
 
 
 
 

Mechanistic interpretability

Limitation of SAE ability to extract all features
Chess (rejected)

Othello

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
[3/7 Edit: I have rephrased the bolded claims in the abstract per this comment from Joseph Bloom, hopefully improving the heat-to-light ratio. Commenters have also suggested training on earlier layers and using untied weights, and in my experiments this increases the number of classifiers found, so the headline number should be 33/180 features, up from 9/180. See this comment for updated results.] A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic’s Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown to contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/position combinations) [edit: 33/180 features, see this comment]. Across random seeds, the autoencoder repeatedly finds “simpler” features concentrated on the center of the board and the corners. This suggests that even if a language model can be interpreted with a human-understandable ontology of interesting, interpretable linear features, a sparse autoencoder might not find a significant number of those features.
Coverage (how many of the given board features are captured) and Board Reconstruction (how accurately the actual board state can be reconstructed using only SAE activations) are two proposed metrics. These proposed metrics distinguish SAE quality differences better than the existing L0, proving they can accelerate interpretability research in environments with clear "correct features" like board games. (
Verifiable Reward
)
 
 
 

Recommendations