Generalized Gated SAE by learning threshold by zeroing out pre-activations
+ L0 loss
JumpReLU SAE with Unit step function
Does this mean they efficiently implemented the gating mechanism using JumpReLU activation?
google jumprelu preliminary
gemma scope jumprelu
differentiable pre-act-loss