화웨이 인터뷰

내 주된 area 는 mech interp 이고 I wan to start from explaining why I started the field

To introduce my research jounery, this is a pivotal paper that makes me to get into mech interp

This paper introduce a SAE whcihc decompose the activation of the hidden state and yielding descriptions for each SAE latent feature.

That means, In a future, an internal understanding of the model would help to control a model in an intended even help the new idea of model design architecture

그래서 일년간 수많은 literarutre와 sae 공부 시작했고 가장 활발한 community 인 lesswrong 에 간단한 research post upload 했다

There are many types of benchmark but what i understood is first, there is concept/probing level, second task level such as mmlu and bias benchmark and general reasoning benchmark to measure intelligence.

I tried to extended the extend this SAE’s steering ability to task benchmark using decomposition of the generation time activation

This paper more focus on pure interpretability
The problem of the RL driven training is it is powerful that increasing reasoning performance but it cause emergent misalignment and unexpectable potential dangerour by the reward model
What I did is keep model without traing a parameter, I simply trained a small control network which selects a specific SAE feature on PPO framework with verifiable reward
That mean we can interpret which feature of the model is helpful to specific tasks and proxy which parts is trained when the model is been trained on reinforcement learning