SAE Loss

normalized version of all MSE numbers, where we divide by a baseline reconstruction error of always predicting the mean activations

Auxiliary-K loss

SAE Decoder Loss

While L0 and L1 allow control of sparsity at the sample level,

SAE High Frequency Latent features emerge because there's no pressure applied to the overall

SAE Feature Distribution across all samples, resulting in many latents that are sparsely but frequently activated (5% of features activate more than 50% of the time). Therefore, if we develop approaches like

BatchTopK SAE, we could provide incentives for different samples to use different features or penalize overlapping feature usage. Through independency loss or

Contrastive Learning, we can achieve globally applicable sparsity.

Reconstruction dark matter within
Dictionary Learning

A significant portion of SAE reconstruction error can be linearly predicted, but there exists a nonlinear error that does not decrease even when increasing the size. Therefore, additional techniques are needed to reduce nonlinear error.

arxiv.org

https://arxiv.org/pdf/2410.14670

Circuits Updates - July 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.