Mechanistic interpretability
Limitation of SAE ability to extract all features
Chess (rejected)
Othello
Coverage (how many of the given board features are captured) and Board Reconstruction (how accurately the actual board state can be reconstructed using only SAE activations) are two proposed metrics. These proposed metrics distinguish SAE quality differences better than the existing L0, proving they can accelerate interpretability research in environments with clear "correct features" like board games. (Verifiable Reward)