similar to Ghost Gradient
Latents are flagged as dead during training if they have not activated for some predetermined number of tokens (typically 10 million).
The full loss is then defined as L + αLaux, where α is a small coefficient (typically 1/32)