CRL Introduciton

introduction

explainable AI, Interpretable AI

Mechanistic Interpretability

LessWrong steering vector

sparse autoencoder 랑 feature등 잘 설명해야할듯

우리는 prompt

Motivation

그냥 솔직하게 brain 에서 영감 받았다 해?

Manipulating Stream of Consciousness – Not Prompt Engineering, but Interpretable Activation Engineering

hallucination 이나 의식의 흐름 그리고 stereotype, 집중 reasonign capability 이런 애들은 단순히 feature단이 아니라 일종의 circuit같이 time series 에서 단계로 일어나는 여러 조합이나 feature의 시간상 흐름에서 표현될듯

Giving LLM Drug confidential to make LLM high

Applying RL to manipulate SAE


LLM 을 활용한 problem solving 위해 Reasoning Performance 를 올리려는 시도는 Test-time Compute 향상하도록 학습을 통해 이뤄진다.
특히 Reinforcement Learning with Verifiable Rewards (RLVR) 는 직접적으로 문제해결에이나 inference-time 에 incentive 를 주는 방식으로 LLM을 학습한다
RLVR 은 벤치마크 간 universal 한 성능향상을 보여주지만 늘어난 test-time은 그만큼 복잡성을 가지는 위험성을 늘입니다.
RLVR 이 정확하게 LLM 추론에 어떤 변화를 가져오고, interpretable 하게 추론 process 를 파악하는 것으로  endowed with increased agency and autonom 에 대처할 수 있습니다.
Inspired by the rapid progress of mechanistic interpretability, and activation steering, this work leverages the internal representations of chat models to train a control model sepeartely.



Sparse Autoencoder (SAE) 는 모델의 interpretable feature를  LLM의 linearly representational space 추출하는 모델로서 학습된다.
Moreover, these feature directions have been shown to be effective causal mediators of behavior, enabling fine-grained steering of model output.
SAE feature 혹은 Trnascocer 를 확용한 LLM 의 circuit discovery reasonign capability 같이 feature단이 아니라 일종의 circuit같이 time series 에서 단계로 일어나는 여러 조합이나 feature의 시간상 흐름의 로직 분석에 활용되고 있다.

In this work, CRL (Control-model Reinforcement Learning)은 RLVR 이 정확하게 LLM 내부 represetnation 어떤 변화를 가져오는지 파악하기 위해 SAE feature 를 활용한 LLM-Control model 만을 학습시켜 RLVR 의 영향을 LLM 으로부터 decoupling 시켰다.
학습된 control model 이 end-to-end RL로 학습된 LLM 과 유사한 행동을 보이는 것을 확인하고 Control model 의 SAE feature 선택 steering 행동을 이용하여 자동화된 circuit discovery pipeline을 구축하였다.

구체적으로 this study는 크게 두 스텝으로 나뉘는데 첫번째로 Control-model training 과정을 거친 다음 Controled model analysis 를 진행하여 circuit discovery 를 진행한다.
Interpretable Control model 을 훈련시키기 위해 RL feedback loop 에서 토큰과 레이어별로 SAE feature 를 residual stream에 addition 하는 control policy 를 학습시킨다.
여기서 우리는 control model 세가지 상황에서 훈련시킨다:
\begin{itemize}
    \item Question answeirng task 에서 선지 중 토큰 하나 생성으로 답변을 정하는 문제일때 하나의 레이어를 컨트롤
    \item Question answering task 에서 선지 중 토큰 하나 생성으로 답변을 정하는 문제일때 여러 레이어를 컨틑롤
    \item Reasoining task 에서 여러 토큰을 생성하여 답변을 정하는 문제일때 하나의 레이어를 컨트롤
\end{itemize}

훈련된 control model 에 대해 Circuit analysis 에는 세가지 방식을 적요하는데 하나는 feature analysis, 두번째는 여러 토큰이나 layer 사이 feature 조합으로 이뤄지는 ciruct 분석 그리고 마지막으로 control model의 universality 체크이다.
Control model 의 universality 체크는 실제 control model 이 RLVR 처럼 reasoning performance 를 향상시키는지 확인하는 것이고 feature analysis 는 task에 관련된 feature 들이 활성화되도록 제대로 control model 이 핛습되었는지를 확인한다.
마지막으로 circuitt analysis 로 control model 에 의해 교정되는 SAE feature 사이의 상호작용을 미니멀하게 확인하므로써 RLVR의 효과를 확인할 수 있다.

Impact statement


In 2005, a letter published in Nature described human neurons responding to specific people sparsely.
The exciting thing wasn't just that they selected for particular people, but that they did so regardless of whether they were shown photographs, drawings, or even images of the person's name; The neurons were multimodal.

즉 Human brain 의 뉴런 발화는 sparse 하게 일어나면서 여러 정보를 구분해 저장하거나 에너지 효율적으로 표현한다.
극단적 분산 코드가 아니라, 어느 CRL 은 neural network에서도 정도 ‘하나의 개념’을 대표하는 뉴런이 실제로 존재할 수 있다는 실증적 근거는 human brain 과 neural network 에서 존재해왔다.
즉 superpositoin 되는 neuron activation 에서 basis feature 들이 dense 해야 한다는 한다는 제약은 없다는 것을 의미한다.

여기에서 영감받아 CRL 은 Trasnformer 기반의 LLM 에 sparse activation 을 강제하여 토큰별 perturbation 을 주입하여 objective 를 학습한다.
기술적으로 이러한 방법론은 단지 RLVR의 효과를 LLM에서 분리하는 것뿐만 아니라 sparse coding 으로 얻어낸 feature들을 재사용하여 내트워크를 분리하는 패러다임으로 이해할 수 있다.
Memory network 와 같이 sprase coding을 통해 LLM을 개선하려는 시도와 일맥상통하며 앞으로 transcoder 에 적용한 control model을 만들거나 이미 훈련된 reasoning model 을 SAE 로 분리한 것을 따라하도록 C-SFT (Control-model Supervised Fine-tuning) 과 같이 확장할 수도 있다.

CRL Introduciton

introduction

Motivation

Impact statement

Recommendations