CRL Abstract

Reinforcement Learning on Internal Representation reveals universal circuit selection networks

sample efficient 로 sft 56 달성을 훨씬 적은 4000 개로 달성

포서싱 제목을 reasoning 으로 잡지 말고 task-wise interpretable circuit으로 잡기

SAE for downstream tasks with RL policy

RLHF 나 RLAIF 가 single step reward 가 아니라는것 강조해야한다

이 프로젝트는 대규모 언어 모델(LLM, 구체적으로 gemma2b)의 내부 활성화(activation)를 특정 레이어에서 수정하여 다운스트림 작업(예: mmlu)의 성능을 향상시키기 위해 강화학습(PPO 알고리즘)을 사용하여 "스티어링 벡터(steering vector)"를 생성하는 network를 학습시키는 것을 목표로 합니다.

관찰 및 행동: 지정된 LLM 레이어에서 PolicyNetwork가 모델의 내부 상태(residual stream)를 관찰합니다.

액션 생성: 관찰된 상태를 기반으로 PolicyNetwork는 희소 벡터 형태의 "액션(action)"을 생성합니다. 이 액션은 선택적으로 희소 오토인코더(SAE)의 사전을 사용하여 변형될 수 있습니다.

스티어링 적용: SteeringHook는 이 액션을 스티어링 벡터로 사용하여 LLM의 정방향 연산(forward pass) 중 해당 레이어의 residual stream에 더하거나 빼서 활성화를 조절합니다.

생성 및 평가: 주어진 작업 입력에 대해 스티어링이 적용된 LLM이 출력을 생성합니다.

보상 계산: 생성된 출력의 정답 여부에 따라 보상(reward)이 계산됩니다.

PPO 업데이트: PPOTrainer는 각 스티어링 레이어에 대해 관찰된 상태, 정책이 취한 액션, 로그 확률, 그리고 계산된 보상을 사용하여 PolicyNetwork와 관련 CriticNetwork를 업데이트합니다

abstraction

prompt engineering으로 지금 성능 향상시키려는 시도 많다

하지만 explicit 하지 않고 black box llm을 high level natural instruction으로 한계가 있다

모델에 따라 부정적일 수 있고 AI safety 측면에서 긴 prompt는 jailbreak 당하기 쉽다

Reinforcement Learning on Verifiable Reward (RLVR) showed effectiveness on univeraally increasing accuracy on several reasoniing tasks with incraseing computing. While this method is widespread across reasoning models, RLVR’s changing mechanism remain poorly understood. In this work, we show that controlled circuit can be learned by reinforcement learning with increased performance on base model and reasoning model. Specifically for each generating token, we chose one feature vector in the Residual Stream space by leveraged Sparse Autoencoder (SAE) and control model. Intersestingly, the trained control model reached higher than the supervised learning and successfully removed the hallucination and bias migigation tasks. Leveraging the method, without training the original model, the isolated circuit soley shows the effect of the RLVR with learned control model which only steer model by the SAE feature. This method suggest a new way of discovering circuit in the LLM through the proxy network. More broadly, our work showcases how the decomposing LLM’s usuperpositioed features is also interp[retable to machine but also to machines.

Reinforcement Learning on Verifiable Rewards (RLVR) has demonstrated broad effectiveness in universally enhancing accuracy across reasoning tasks. However, the mechanisms underlying RLVR's success remain unclear which part of changes contributed this improvement. In this work, we show that RL on control model can identify circuit can be learned by reinforcement learning with increased performance on base model and reasoning model. Specifically, for each generated token, our approach selects a sparse feature vector within the model's Residual Stream using a Sparse Autoencoder (SAE)-based control mechanism. Remarkably, our reinforcement learning-trained control model effectively reducing hallucinations and biases with reaching within 2000 samples. Importantly, this method isolates activation circuits solely by leveraging SAE features without training of the original model, illustrating learned capabilities. Overall, our findings suggest a novel pathway to circuit discovery within large language models through proxy networks, highlighting how sparse model features enhances interpretability for both humans and machines.

Reinforcement Learning on Verifiable Rewards (RLVR) has demonstrated broad effectiveness in enhancing accuracy across various reasoning tasks. However, the precise contributions of RLVR's internal changes on LLM to this improvement remain unclear. In this work, we show that reinforcement learning on a dedicated control model can explicitly identify interpretable circuits, boosting performance in both base model and reasoning models. Specifically, for each generated token, we select a sparse feature vector within the model’s Residual Stream using a Sparse Autoencoder (SAE)-based control mechanism. Remarkably, our RL-trained control model effectively reduces hallucinations and biases with fewer than 2000 samples. Crucially, our method isolates activation circuits solely through SAE-derived features without modifying the original model, clearly illustrating the capabilities of these learned circuits. Overall, our findings provide a novel pathway for circuit discovery in large language models via proxy networks, highlighting how sparse model features substantially improve interpretability for both humans and machines.

Reinforcement Learning on Verifiable Rewards (RLVR) has demonstrated broad effectiveness in improving performance across various reasoning tasks. However, the precise contributions of RLVR's internal changes on LLM to this improvement remain unclear. In this work, we show that reinforcement learning, applied through a dedicated control model, explicitly identifies interpretable activation circuits, enhancing performance in both base and reasoning models. Specifically, our approach selects a sparse feature vector from the model’s residual stream for each generated token, utilizing a Sparse Autoencoder (SAE)-based control mechanism. Remarkably, the control model successfully reduces hallucinations and biases with fewer than 2000 training samples. By isolating these activation circuits exclusively through SAE-derived features, without altering the original model parameters, our method clearly reveals the functional roles of the learned circuits. Overall, our findings suggest a novel pathway to circuit discovery within LLMs through proxy networks, highlighting how sparse model features enhance interpretability for both humans and machines.

thesis framework

introctuion and abstract

New version

Recent mechansitic interpreatbility work shows that from dense activation, SAE can extract sparse monosemantic features. Similariy, Computaional neuroscientist discovererd tahat Brain architecture are not only utilizing dense neurons but also sparse activated neurons. From this selected evolution with the steering aspect of the SAE features, we propose a training method without modifying original parameters by learning to automated steer trasformers with RLVR (Reinforcement Learning on Verifiable Rewards) by observing internal representation. To explore broad and sparse feature dictionary from SAEs, we devised Adative Feature Masking and utilized high epsilon. Trained MLP network as a control model proved a efficiency on improving several question answering, bias mitigation, jailbreaking, halluciation, and reasoning tasks. Notably for Jaibreaking benchmark XSTest, for Gemma 2 2b model improve from 73% to 85% with only 50 samples. Our method suggests univeraslly appliable training across benchmark. More broadly our reserach demonstrated a practical asepect of mechanistic interpretability.

Recent work in mechanistic interpretability has shown that sparse autoencoders (SAEs) can extract sparse and monosemantic features from superpositioned dense activations. In a meantime, findings in computational neuroscience suggest that brain architectures utilize both dense and sparsely activated neurons. Inspired by this analogy and leveraging the steering capabilities of SAE features, we propose a training method to steer transformer representations without modifying the model’s original parameters. Our approach CRL (Control Reinforcemen Learning) trains an MLP-based control model perturbs with a single SAE feature by observing token-level internal activations and optimizing them with verifiable rewards. To encourage broad and sparse feature exploration, we introduce Adaptive Feature Masking (AFM) and employ a high-epsilon regime. CRL achieves strong performance across various tasks including question answering, bias mitigation, jailbreak prevention, hallucination reduction, and multi-step reasoning. Notably, for the jailbreak benchmark XSTest on Gemma 2 2B, our method improves accuracy from 73% to 85% using only 50 samples. These results demonstrate the universal applicability of our method across benchmarks and highlight a practical pathway for using mechanistic interpretability in the automated control of language models.

Recent work in mechanistic interpretability has shown that sparse autoencoders (SAEs) can extract sparse, monosemantic features from superpositioned dense activations. Meanwhile, findings in computational neuroscience suggest that brain architectures utilize both densely and sparsely activated neurons. Inspired by this analogy and leveraging the steerable nature of SAE features, we propose a method to steer transformer representations without modifying the model’s original parameters. Our approach, Control Reinforcement Learning (CRL), trains an MLP-based control model that selectively perturbs individual SAE features by observing token-level internal activations and optimizing these perturbations based on verifiable rewards. To optimize exploration within a constrained feature space, we introduce Adaptive Feature Masking (AFM) and employ a high-epsilon regime to encourage diverse feature discovery. CRL achieves strong performance across diverse tasks including question answering, bias mitigation, jailbreak prevention, hallucination reduction, and multi-step reasoning. Notably, on the jailbreak benchmark XSTest with the Gemma 2 2B model, our method improves accuracy from 73% to 85% using only 50 training samples. Together, these results demonstrate the universal applicability of CRL across benchmarks and highlight a practical pathway for employing mechanistic interpretability toward the reward-aligned control of AI behavior.


  Reinforcement Learning on Verifiable Rewards (RLVR) has demonstrated broad effectiveness in improving performance across various reasoning tasks.
  However, the precise contributions of RLVR's internal changes on LLM to this improvement remain unclear.
  In this work, we show that reinforcement learning, applied through a dedicated control model, explicitly identifies interpretable activation circuits, enhancing performance in both base and reasoning models.
  Specifically, our approach selects a sparse feature vector from the model's residual stream for each generated token, utilizing a Sparse Autoencoder (SAE)-based control mechanism.
  Remarkably, the control model successfully reduces hallucinations and biases with fewer than 2000 training samples.
  By isolating these activation circuits exclusively through SAE-derived features, without altering the original model parameters, our method clearly reveals the functional roles of the learned circuits.
  Overall, our findings suggest a novel pathway to circuit discovery within LLMs through proxy networks, highlighting how sparse model features enhance interpretability for both humans and machines.

feedback

sentence is too long in the abstract and make it easy

exp1) bold title and short description sentence

bold two three word of scientific contribution

Impact statement

format or style with contribution to science

Propose Publication, Knowledge Transfer

merge Background and Literature Review

Not having one large expreriment, exp1 exp2 exp3

Method 도 bold 하나 exp 볼드 3개라서 chater 4개 하라는 거

chatper title 1.3 same experiment

chatper 3 literarture review - combine 2 and three

좀 웃으라고 하는게 열받네 지 꼴리는데로 ㅅㅂ

다음부터 나는 평소에 괭창히 웃음많은 사람이지만 논문을 웃기게 적고싶진 않다.