Auxiliary-K loss

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 27 12:26
Editor
Edited
Edited
2025 Feb 18 17:18
Refs
Refs
similar to
Ghost Gradient
Latents are flagged as dead during training if they have not activated for some predetermined number of tokens (typically 10 million).
The full loss is then defined as L + αLaux, where α is a small coefficient (typically 1/32)
L(x):=xx^(f(x))22Lreconstruct+λS(f(x))Lsparsity+αLauxauxiliary loss\mathcal{L}(\mathbf{x}) := \underbrace{\left\|\mathbf{x} - \hat{\mathbf{x}}\bigl(f(\mathbf{x})\bigr)\right\|_2^2}_{\mathcal{L}_{\text{reconstruct}}} + \underbrace{\lambda \mathcal{S}\bigl(f(\mathbf{x})\bigr)}_{\mathcal{L}_{\text{sparsity}}} + \underbrace{\alpha \mathcal{L}_{\text{aux}}}_{\text{auxiliary loss}}
 
 
 

Recommendations