리서치노트 CorrSteer 데모 설계

데모 설계 하려는데 기존에 v0 이랑 chatgpt 조합으로 해보려고 한다.

v0 하려면 이미지 필요해서 figma 쓰려는데 생각보다 편하고 좋다 다만 프로플랜 필요해서 무료 ucl edu 계정 으로 진행중임

무료도 ㄱㅊ은데 team libary 가 유료버전만 가능해서 shadcn doc에 또 있길래 publish libary 햇다

여튼 import from figma 도 v0 에서 있어서 프레임깔기전에는 하양으로 나왔는데 하양 프레임 깔고 나니까 이미지 가져와지긴 하는데 그냥 캡쳐해서 하는거랑 다른거 없는듯?

여튼 ui 틀은 v0 에서 짜고 chatgpt 와서 (이미지 업로드가 지금 오류임) 기능들 improve 하다봄

mbti gpt 랑 분명 코드 class 똑같은데 왜 가운데 정렬이 안되는건지 한참 고민했는데 vite-react-ts 기본 css 에 main 등 걸려있는게 문제였음

앞으로 조심하는데 vite ts minimum 이 생각보다 설정할거 많아서 다음에 chat streaming example 기여할때는 흠 shadcn 말고 tailwind로만 하던가 해야겠다

suffix가 아니라 currentModel 뒤에 circle 붙이도록

prompt 아래쪽에 텍스트 생성 추가에 따라 카드 크기 안변해야하니까 공간 마련해놔. 다만 card size 를 고정하거나 하는 방식으로 텍스트 들어올 때 비어있는 공간 그대로 남아있지 않게

12 post 가 정확히 어디인지 파악 jupyter

Zakun

plan

모델 다양한거 추가 예정

데이터셋 여러가지 테스트해볼 예정

distribution 이나 steering vector 전략 coefficient 등 재점검 예정

1st Weekly sync📅 Date January 23rd

1. CorrSteer (targeting ACL 2025 demo)

I created a demo page for CorrSteer (Text Classification dataset can be used to steer LLMs, Correlating with SAE features).

Demo Video: https://holisticai.slack.com/files/U0874LH2XD2/F089S46HRML/2025-01-23_13-01-57.mp4

What are you proud of, excited by or most energized by?

Other research results

I tried some other datasets, but some of the classification labels did not show high correlation (EMGSD dataset too), so it might be great to show how many classification labels could be correlated to the SAE.

I am investigating more about how to steer features more efficiently. Currently, all of the token's residuals are added with steering vector. But I am considering using quantile or only last few tokens for steering based on other research.

⏭️

What are our next steps?

I will apply to one or two other models such as Gemma with GemmaScope

Apply another Text classification Datasets

As I wrote above, apply better strategy (minimal time priority work)

Visualize activation distribution across features

2. RL based Unlearning (EMNLP or NeurIPS 2025)

research results

Designs action space, state space, and observation space for Reinforcement Learning

Action space: SAE feature index and coefficient
State space: MLP Activation or Residual vector
Observation space: Next token distribution

⏭️