Jacobian SAE

JSAE

two SAEs: one trained on the input activations and the other trained on the output activations.

simultaneously train two separate SAEs on the input and output

Here, k is the number of non-zero elements in the TopK activation function

Scalar function

Input: The j-th element of the input vector of SAE (a scalar value) Output: The i-th element of the SAE's output vector corresponding to that input change (a scalar value). This allows us to analyze whether the relationships between individual latents are linear or nonlinear, and verify if the corresponding element of the Jacobian accurately predicts these changes. And interestingly, scalar function is mostly linear.

arxiv.org

https://arxiv.org/pdf/2502.18147

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations — LessWrong

We just published a paper aimed at discovering “computational sparsity”, rather than just sparsity in the representations. In it, we propose a new ar…

https://www.lesswrong.com/posts/FrekePKc7ccQNEkgT/paper-jacobian-sparse-autoencoders-sparsify-computations-not

Jacobian SAE

JSAE

Scalar function

Backlinks

Recommendations