Latents are flagged as dead during training if they have not activated for some predetermined number of tokens (typically 10 million).
The full loss is then defined as L + αLaux, where α is a small coefficient (typically 1/32)
L(x):=Lreconstructx−x^(f(x))22+LsparsityλS(f(x))+auxiliary lossαLaux