Loading views...

화웨이 인터뷰

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Nov 20 8:58
Editor
Edited
Edited
2025 Nov 20 9:24
Refs
Refs
  1. 내 주된 area 는 mech interp 이고 I wan to start from explaining why I started the field
      • To introduce my research jounery, this is a pivotal paper that makes me to get into mech interp
      • This paper introduce a SAE whcihc decompose the activation of the hidden state and yielding descriptions for each SAE latent feature.
      • And even this feature can steer a model’s response in an intended way
      • That means, In a future, an internal understanding of the model would help to control a model in an intended even help the new idea of model design architecture
  1. 그래서 일년간 수많은 literarutre와 sae 공부 시작했고 가장 활발한 community 인 lesswrong 에 간단한 research post upload 했다
      • 간단하게 deep dive into sae and sae is too dependent on dataset
  1. Another by product paper due to the deep dive
  1. CorrSteer
      • There are many types of benchmark but what i understood is first, there is concept/probing level, second task level such as mmlu and bias benchmark and general reasoning benchmark to measure intelligence.
      • I tried to extended the extend this SAE’s steering ability to task benchmark using decomposition of the generation time activation
  1. CRL
    1. This paper more focus on pure interpretability
    2. The problem of the RL driven training is it is powerful that increasing reasoning performance but it cause emergent misalignment and unexpectable potential dangerour by the reward model
    3. What I did is keep model without traing a parameter, I simply trained a small control network which selects a specific SAE feature on PPO framework with verifiable reward
    4. That mean we can interpret which feature of the model is helpful to specific tasks and proxy which parts is trained when the model is been trained on reinforcement learning
 
 
notion image
notion image
notion image
notion image
notion image
 

Recommendations