Transformer Quantization using Sparse Activation

SAE based compression

non feasible

This research requires a lot of mathematical analysis

Using this insight, there might be some way to ‘compress’ LLMs such as we deduce unnecessary features or activation quantization

Also, if the transformer block is linear transformation, we could map features of each layer into next block’s activation tensor which could be simply hack into additive equation

지식을

Machine Unlearning 측면에서 safety에 너무너무 중요하고 좋을 것 같은데 특히 sae 로 타겟팅해서 줄이거나 dead neuron으로 압축 가능하면 너무 좋은 연구일듯

다만 이걸 어떻게 llm 에 타켓팅해서 적용시키냐인데 decoder 로 reconstructed vector 잘 활용하는게 중요할듯

Super Weight and super activation

low-rank projection

SAE-aware quantization

residual 뿐 아니라 neuron-weights matching 시도해서

해당 block 내 weight에서 다른 neuron 영향 안주면서 타겟 neuron 에만 correlation 가지는 거 찾기 m:1
그놈들 전부 제거한다음 low-rank 화
complexity 가 문제인데 (d: matirx depth, r: residual, f: sae features, t: test tokens, w: weight tests counts)

f 는 빼도 될듯 sae feature dictionary 에 대한 영향은 한번에 correlation matrix 보면 되니까.
t 도 batch 로 한번에 처리하면 될듯
w 는 변화시키면서 해야해서 배치처리는 안되니 실질적인 개수는

Transformer Quantization using Sparse Activation

SAE based compression

Recommendations