Scaling Law for SAEs
- Loss decreases approximately according to a power law with computation
- As computational resources increase, the optimal FLOPS allocation for training steps and number of features increases approximately according to a power law
- At tested compute budgets, the optimal number of features tends to increase faster than the optimal number of training steps
The extent to which using additional compute improves dictionary learning results. In an SAE, compute usage primarily depends on two key hyperparameters, the number of features being learned, and the number of steps used to train the autoencoder.