normalized version of all MSE numbers, where we divide by a baseline reconstruction error of always predicting the mean activations
- L0
- L1
- L2
- KL
While L0 and L1 allow control of sparsity at the sample level, SAE High Frequency Latent features emerge because there's no pressure applied to the overall SAE Feature Distribution across all samples, resulting in many latents that are sparsely but frequently activated (5% of features activate more than 50% of the time). Therefore, if we develop approaches like BatchTopK SAE, we could provide incentives for different samples to use different features or penalize overlapping feature usage. Through independency loss or Contrastive Learning, we can achieve globally applicable sparsity.
Reconstruction dark matter within Dictionary Learning
A significant portion of SAE reconstruction error can be linearly predicted, but there exists a nonlinear error that does not decrease even when increasing the size. Therefore, additional techniques are needed to reduce nonlinear error.
While end-to-end training with KL divergence requires more computational resources, using KL divergence just for fine-tuning proves to be effective.
