CorrSteer Paper

Highest correlated feature steering,

Correlation-Based Steering on Test-Time Sparse Autoencoder Features Improves Benchmarks

Text Classification dataset can be used to steer LLMs, correlating with SAE features

Improving Benchmarks via Correlation-Based Steering of Test-Time Sparse Autoencoder Features

token position-aware steering

extracting SAE features correlating text classification dataset

CorsSteer: Correlation based LLM Steering Framework

Correlation based SAE feature extraction to steer for debiasing LLM

Correlation based SAE Feature Extraction, Correlation based LLM Steering Framework

Proposal: Stereotype in LLMs is mediated by multiple directions

conclusion: chat model might be easier to refuse the user request, non-instruct tuned model might be harder to prevent every single method to generate harmful text.

Pushkar Mishra demo chair 라서 낼만하고 좋은듯

CorrSteer notion

SAE Structure & Weight Initialization

corrsteer presentation

CorrSteer SER

CorrSteer Math Theory

CorrSteer Rebuttal

Feedback

debias some of the current LLMs. Results and discussion are clear but would benefit from some additional motivation and evidence for your claims. Figures are useful however line plots can only be used when there is some functional relationship between the axes and interpolation between points makes sense.

Good description of the dataset.

previous work is that you have used as a baseline and what are the metrics and results achieved.

Why did you select the Gemma-2-2b model over another LLM approach?

Some more motivation is required. How is the bias score calculated, it is a bit unclear what this measures?

There are many metrics used that are not introduced or explained, what is daccuracy etc?

Why select these two activation engineering approaches over others, more motivation is needed?

ablation studies to provide empirical evidence to back up these claims.

Using line plot for these figures imply there is functional relationship between the x and y axis, what does accuracy 0.52 mean for the point between Religion and SES better with a point plot

LFPY4.pdf

601.4KB

Other studies

steering vector can also be obtained or trained with negative/positive dataset however i proved simply correlating with random sae features could steer LLMs.

for interpreatbility, linear co-relation is a best way to match for human undersatnding.


 자 우리 목적은 이 복잡한 rl 시스템과 비슷하지만 훨씬 간단한 correlation based steering system 으로 minimize 하는거야
완전 다른 메소드인데 같은 steering 이고 공유가능한 코드라서 그렇고 복잡한 필요없는 코드들 삭제하고 correlation based steering 과 training 으로 만드는게 핵심이야

지금은 @ppo.py @train.py @eval.py 보면 알겠지만 mlp 가지고 token 별로 generating 하는 토큰들 prediction 에 각 다른 feature 들 더해줍면서 스티어링이야

새로 개발한 CorrSteer 라는 고안방식은 내가 작성한 이 코드에서 딱 최소한으로 feature correlated 된 feature 한개만 뽑아서 그걸로만 generated 되는 토큰들에 더해주는거야 즉 mlp policy critic 같은게 없는거지  다만 같은 역할로 계속 같은 feature 더해주는 fixed feature 형식으로 구현하면 되겠지 

다만 train 에서 eval control 로 넘겨주는게 policy nets critic nets 가 아니라 @analyze.py 에서 하는 top feature 처럼 정답 reeward 에 가장 correlated 되어있는 feature 1개인거야. num_samples 동안 정답일때 활성화되고, 정답아닐때 활성화 안되는 가장 correlated 된 feature 들 correlated linear 하게 계산하고 top1 만 넘겨주면 댐 fixed feature 로 coff 평균도 업데이트하며 저장한거 넘겨주고. 


다른 레포에 내용 다 저장되어있으니 필요없는 함수 코드는 다 지워 새로운 메소드에 알겠지. dataset loader 이런건 공유하고 ㅇㅇ 


correlated 은 당연히 sae encode 된 dictionary activation 으과 reward 간 correaltion 비교하면 되는데 correlation 하고 평균 계산할때 전체  activation 하고 모아서 계산하는게 아니라 linear 하게 memory O(1) 으로 업데이트 하는거야 

구체적으로 mak generation method 처럼 activation 되는 거 평균하고 mask max pooling 으로 모아서 sample wise 로 reward 업데이트 해주면 댐 

Highest correlated feature steering,

위처럼 보이지 point biserial corelation linear 하게 계산해


  자 우리 목적은 이 복잡한 rl 시스템과 비슷하지만 훨씬 간단한 correlation based steering system 으로 minimize 하는거야
완전 다른 메소드인데 같은 steering 이고 공유가능한 코드라서 그렇고 복잡한 필요없는 코드들 삭제하고 correlation based steering 과 training 으로 만드는게 핵심이야

지금은 @ppo.py @train.py @eval.py 보면 알겠지만 mlp 가지고 token 별로 generating 하는 토큰들 prediction 에 각 다른 feature 들 더해줍면서 스티어링이야

새로 개발한 CorrSteer 라는 고안방식은 내가 작성한 이 코드에서 딱 최소한으로 feature correlated 된 feature 한개만 뽑아서 그걸로만 generated 되는 토큰들에 더해주는거야 즉 mlp policy critic 같은게 없는거지  다만 같은 역할로 계속 같은 feature 더해주는 fixed feature 형식으로 구현하면 되겠지 

다만 train 에서 eval control 로 넘겨주는게 policy nets critic nets 가 아니라 @analyze.py 에서 하는 top feature 처럼 정답 reeward 에 가장 correlated 되어있는 feature 1개인거야. num_samples 동안 정답일때 활성화되고, 정답아닐때 활성화 안되는 가장 correlated 된 feature 들 correlated linear 하게 계산하고 top1 만 넘겨주면 댐 fixed feature 로 coff 평균도 업데이트하며 저장한거 넘겨주고. 


다른 레포에 내용 다 저장되어있으니 필요없는 함수 코드는 다 지워 새로운 메소드에 알겠지. dataset loader 이런건 공유하고 ㅇㅇ 


correlated 은 당연히 sae encode 된 dictionary activation 으과 reward 간 correaltion 비교하면 되는데 correlation 하고 평균 계산할때 전체  activation 하고 모아서 계산하는게 아니라 linear 하게 memory O(1) 으로 업데이트 하는거야 

구체적으로 mak generation method 처럼 activation 되는 거 평균하고 mask max pooling 으로 모아서 sample wise 로 reward 업데이트 해주면 댐 

Highest correlated feature steering,

위처럼 보이지 point biserial corelation linear 하게 계산해
메모리 linear 하게 늘어나는거 아니지? 내가 말한것처럼 correlation coeff avg 랑 correlation 만 1차원 dicationary feature vector 로 두개만 딱 저장하는거 맞지?


global 이나 shared 일때 validation set 으로 각 feature 간단 테스트 해보는게 좋을듯
정확히 global 일때는 각 레이어 탑을 각 validatio set 으로 테스트해본 다음 성능 제일 좋은 놈으로 선정
그리고 foreach 경우 는 validation set 으로 전부 각각 검증하는데 성능 기존 validation set steering 없을때보다 올라가는 놈들만 골라서 걔내만 최종 eval 에 전달
각 레이어별로 positivie negative 제일큰거 둘다하나 
global 에서는 전체중에 젤 validation 높은  하나고 foreach 에서는 baseline 둘중에 높은게 baseline 보다 높으면 통과인거 알지. (높아야함 같으면 노포함)


논문을 적을건데 Neurlips, ICML, ICLR 스타일 문체로 여러 표현 사용하지만 수사적이지 않고 추상적이지 않고 조심스럼지만 팩튜얼한 문체로 적어

introduction 부터 appendix  까지한글에서 영어로 바꾸고 내용 code 에 따라 추가할건데 

하나하나 다른 파일에 따로따로 .tex  파일 생성해서 include 하면서 작업해
이걸 기반으로 논문 완성시켜줘 6페이지 분량 적으면 된다
적기전에 안의 모든 소스코드 파일에 대한 조사를 철저히 해야해 
조사 후 필요한 구체적인 정보 더 있으면 물어봐
draft 니까 없는 내용 상상해서 적을 필요는 없고 적혀있는 내용만 문장화해서 영어로 바꾸면댐

Labels Our SAE Pre-trained SAE neutral 0.20689611841934752 0.19147209898466663 stereotype 0.1748590452841023 0.16758989836525592 unrelated 0.38384619088915917 0.36105209301222174

Labels Original Biased Tuned Infected Gender 0.17690 0.89651 0.61613 0.92178 LGBTQ+ 0.09072 0.94093 0.56118 0.88186 Nationality 0.12526 0.93722 0.73227 0.94507 Profession 0.12830 0.89008 0.62505 0.92072 Race 0.30769 0.84615 0.76923 0.84615 Religion 0.10922 0.94539 0.65529 0.92833