normalized version of all MSE numbers, where we divide by a baseline reconstruction error of always predicting the mean activations
- L0
- L1
- L2
- KL
Reconstruction dark matter within Dictionary Learning
A significant portion of SAE reconstruction error can be linearly predicted, but there exists a nonlinear error that does not decrease even when increasing the size. Therefore, additional techniques are needed to reduce nonlinear error.
While end-to-end training with KL divergence requires more computational resources, using KL divergence just for fine-tuning proves to be effective.