Activation Sparsity

Activation 이 sparse한 정도로 높을수록 불필요한 계산을 줄이고 모델의 효율성을 높이는 데 도움을 준다.

Superposition Hypothesis에 의해 이론이 뒷받침되며 높은 activation sparsity는

Mechanistic interpretability 에도 도움을 준다. (

Neuron SAE로 분리하기 쉽다)

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity...

Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function,...

https://arxiv.org/abs/2402.13516

web.stanford.edu

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/final-reports/final-report-169721612.pdf

CATS with threshold activation

When activation vector is sparse such as from ReLU, computation can be optimized. However, modern architectures like SwiGLU mostly have nonzero values, they contain many epsilon values, so removing them with a threshold makes computation efficient.

arxiv.org

https://arxiv.org/pdf/2404.08763

Activation Sparsity

CATS with threshold activation

Recommendations