Correlation based?
for
- how effectively Transformer model uses dimension
- how model could be scaled more
- how to manipulate dimension size
how
- attention head usage spasity
- activation sparsity
- SAE (non-realtime so need real time)
problem
- open source model 한정
- ai safety or 적어도 interpretability 랑 연결성 → bias 데이터셋사용
how about
- High Correlation means highly steerable
- 맞는 말이지만 holistic 이랑 너무 겹치고
- activation sparsity 랑 sae sparsity 연결지어서? mutual information
Eval targer을 SAE로 할지 LLM으로할지에 따라
sae로 하면 activation vector와 feature sparse vector
keywords
- ai jailbreak
- red teaming
얼마나 적은 activation sparsity로 같은 text를 표현할 수 있는지 which means model more efficiently use dimension space
Seonglae Cho