내 주된 area 는 mech interp 이고 I wan to start from explaining why I started the field
To introduce my research jounery, this is a pivotal paper that makes me to get into mech interp
This paper introduce a SAE whcihc decompose the activation of the hidden state and yielding descriptions for each SAE latent feature.
And even this feature can steer a model’s response in an intended way
That means, In a future, an internal understanding of the model would help to control a model in an intended even help the new idea of model design architecture
그래서 일년간 수많은 literarutre와 sae 공부 시작했고 가장 활발한 community 인 lesswrong 에 간단한 research post upload 했다
간단하게 deep dive into sae and sae is too dependent on dataset
Another by product paper due to the deep dive
CorrSteer
There are many types of benchmark but what i understood is first, there is concept/probing level, second task level such as mmlu and bias benchmark and general reasoning benchmark to measure intelligence.
I tried to extended the extend this SAE’s steering ability to task benchmark using decomposition of the generation time activation
CRL
This paper more focus on pure interpretability
The problem of the RL driven training is it is powerful that increasing reasoning performance but it cause emergent misalignment and unexpectable potential dangerour by the reward model
What I did is keep model without traing a parameter, I simply trained a small control network which selects a specific SAE feature on PPO framework with verifiable reward
That mean we can interpret which feature of the model is helpful to specific tasks and proxy which parts is trained when the model is been trained on reinforcement learning