Activation Sparsity

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 5 22:38
Editor
Edited
Edited
2024 Nov 30 13:41
Refs
Refs
Activation 이 sparse한 정도로 높을수록 불필요한 계산을 줄이고 모델의 효율성을 높이는 데 도움을 준다.
Superposition Hypothesis
에 의해 이론이 뒷받침되며 높은 activation sparsity는
Mechanistic interpretability
에도 도움을 준다. (
Neuron SAE
로 분리하기 쉽다)
 
 
 

CATS with threshold activation

When activation vector is sparse such as from ReLU, computation can be optimized. However, modern architectures like SwiGLU mostly have nonzero values, they contain many epsilon values, so removing them with a threshold makes computation efficient.
 

Recommendations