CRL Paper Plan

change Motivation to attribution grpah

그리고 attribution graph 같은거 너무 복잡하다 human interpretblae - single control

inference time alignment

성능향상 tracking 방식 제안 model diffing 느낌인가

figure font size

llama gsm8k 없어도 되고왜냐면 architectural generarlization 만 보여주면 됨 single famility 넘어서 multiple architecture tested

layer wise steerirng result 하고 해상도

crl corrsteer 같은 feature 발견하나 확인 - 이건나중에

더해진 feature correlation 도 포함

jumprelu? 정확구현확인

selective performance llama

system diagram 에서 crl token layer 둘다 direction?

이미지 화질 교체

논문 방향 싱글레이어로 가되, 전체가능하던 심플 태스크 성능 언급 89 등

mathmatical representation 은 background 랑 method 두개통일 sae

layer all shared markov decision process ^{ell} 로 구성하기

method onehot 짜치니 argmax topk 하고 1 으로

feature diversity 는 policy layer depth 늘어날수록 적었고 critic loss 도 critic layer depth 작을수록 좋았다.

Option

Correlation Steering Warmup

—epsilon

epsilon 이랑 act 랑 masing 적용위치 및 순서

jumprelu

조금 낮아짐
근데 마스킹하고 같이하니 78 달성 뭐지 ㄷ
multi feature or single feature
threshold initizlization
stage 1 에도 할지

—q

변함없음

loss softmax

보통 낮은데

(—grpo)

더 효과적인 sparse selection 구조

token decay 혹은 그냥 correation 더하기보다 곱하기 - 성능 100 으로 73한거 유지로 hyperparameter 삭제로 좋다, feature 동일사용은 여전히 같다.

음수 corr 음수 logit 경우 고려해야하나

실데 스티어링 corr 업에이트할때 선택한 sae feature 랑 더해진 coeff 로 corr 계산해야함

activation decay 로 steering 0.99 나 0.95 로 줄여나갈까 - 성능유지는 했는데 토큰 길이 짧아서 별의미없 73.21

현재토큰 활성화 중에서만 masking 하면 되잖아 성능만 제발 유지되면 encode 중에서

혹은 현재꺼 반대 마스킹 새로운거 더하기위해
correlation 곱해주는 곳에다가 1 아니면 corr 이렇게 해도 되고

1. Gumbel Softmax + Top-K Selection

성능 떨어짐


class GumbelTopK(nn.Module):
    def __init__(self, temperature=1.0, k=1):
        self.temperature = temperature
        self.k = k
    
    def forward(self, logits):
        # Gumbel noise for differentiable sampling
        gumbel_noise = -torch.log(-torch.log(torch.rand_like(logits)))
        noisy_logits = (logits + gumbel_noise) / self.temperature
        # Differentiable top-k selection
        return torch.topk(noisy_logits, self.k, dim=-1)

2. Sparse Attention Mechanism

성능유지


class SparseAttention(nn.Module):
    def __init__(self, dim, sparsity=0.1):
        self.attention = nn.Linear(dim, dim)
        self.sparsity = sparsity
    
    def forward(self, x, correlation_weights):
        attn_weights = torch.softmax(self.attention(x), dim=-1)
        # Apply sparsity mask based on correlation
        sparse_mask = correlation_weights > correlation_weights.quantile(1-self.sparsity)
        return attn_weights * sparse_mask

3. Straight-Through Estimator (STE)

성능유지


class StraightThroughTopK(nn.Module):
    def forward(self, logits, k=1):
        # Forward: discrete top-k
        _, indices = torch.topk(logits, k, dim=-1)
        y_hard = torch.zeros_like(logits).scatter_(-1, indices, 1.0)
        # Backward: continuous gradients
        y_soft = torch.softmax(logits, dim=-1)
        return y_hard - y_soft.detach() + y_soft

4. Learnable Sparse Gates

성능유지


class SparseGate(nn.Module):
    def __init__(self, dim):
        self.gate = nn.Parameter(torch.ones(dim))
        self.threshold = nn.Parameter(torch.tensor(0.5))
    
    def forward(self, x, correlation_bias=None):
        gate_scores = torch.sigmoid(self.gate)
        if correlation_bias is not None:
            gate_scores = gate_scores + correlation_bias
        # Learnable sparsity threshold
        sparse_mask = (gate_scores > self.threshold).float()
        return x * sparse_mask

1. Gumbel Softmax: 미분 가능한 discrete sampling

2. Sparse Attention: correlation을 attention weight로 활용

3. STE: discrete selection + continuous gradients

or simply

Token-wise context-dependent correlation → decreased


    else:
      # Stage 2: PPO + context-dependent correlation boost
      raw_logits = self.fc1(observation)  # (batch, seq_len, dict_size)
      if selection_weights is not None:
        # Make correlation boost context-dependent by scaling with observation magnitude
        obs_magnitude = torch.norm(observation, dim=-1, keepdim=True)  # (batch, seq_len, 1)
        context_scale = torch.sigmoid(obs_magnitude)  # Normalize to [0,1]
        selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * context_scale  # (batch, seq_len, dict_size)
        raw_logits = raw_logits + selection_boost

Token position linear freedom → same (후반강조는 오히려 낮아지고)


  else:
      # Stage 2: PPO + token-wise correlation boost
      raw_logits = self.fc1(observation)  # (batch, seq_len, dict_size)
      if selection_weights is not None:
        # Method 1: Position-dependent correlation (different per token)
        batch_size, seq_len, dict_size = raw_logits.shape
        
        # Create position-dependent correlation weights
        # Use a simple linear combination based on token position
        position_weights = torch.linspace(0.1, 1.0, seq_len, device=raw_logits.device)
        position_weights = position_weights.view(1, seq_len, 1)  # (1, seq_len, 1)
        
        # Apply position-dependent correlation boost
        selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * position_weights
        raw_logits = raw_logits + selection_boost

Attention-based correlation weighting → same


    else:
      # Stage 2: PPO + adaptive correlation boost
      raw_logits = self.fc1(observation)  # (batch, seq_len, dict_size)
      if selection_weights is not None:
        # Option 1: Simple scaling (기존 방식)
        # selection_boost = selection_weights.unsqueeze(0).unsqueeze(0)
        
        # Option 2: Context-dependent scaling
        # Compute attention weights based on PPO logits strength
        ppo_strength = torch.norm(raw_logits, dim=-1, keepdim=True)  # (batch, seq_len, 1)
        attention_weights = torch.softmax(ppo_strength.squeeze(-1), dim=-1).unsqueeze(-1)  # (batch, seq_len, 1)
        
        # Apply correlation boost proportional to PPO confidence
        selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * attention_weights
        raw_logits = raw_logits + selection_boost

Learnable mixing parameter → same


    else:
      # Stage 2: PPO + learnable correlation boost
      raw_logits = self.fc1(observation)  # (batch, seq_len, dict_size)
      if selection_weights is not None:
        # Apply learnable correlation boost
        selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * self.mixing_weight
        raw_logits = raw_logits + selection_boost
      epsilon = torch.randn_like(raw_logits) * self.epsilon
      raw_logits_noisy = raw_logits + epsilon
      if isinstance(self.act, JumpReLU):
        logits_noisy = self.act(raw_logits_noisy, critic_values)
      else:

Prompts


@train.py @ppo.py @steer.py   read line by line and get comprehensive understanding

1 INTRODUCTION
A 2005 Nature study revealed human neurons firing sparsely for specific individuals; these cells
responded identically to photos, drawings, or even written names. The neurons were multimodal,
encoding pure conceptual information divorced from sensory modality. Rather than relying on
extremely distributed coding, empirical evidence from both human brains and neural networks
suggests that neurons representing ’single concepts’ can indeed exist. In other words, there’s no
fundamental constraint requiring basis features in superposed neuron activations to be dense.
Drawing inspiration from this, we propose Control Reinforcement Learning (CRL), which forces
sparse activation in Transformer-based LLMs, injecting token-specific perturbations to learn the
objective function. Recent work in mechanistic interpretability has shown that sparse autoencoders
(SAEs) can extract sparse, monosemantic features from superpositioned dense activations (Bricken
et al., 2023). Meanwhile, findings in computational neuroscience suggest that brain architectures
utilize both densely and sparsely activated neurons.
Inspired by this analogy and leveraging the steerable nature of SAE features, we propose a method to
steer transformer representations without modifying the model’s original parameters. Our approach
trains an MLP-based control model that selectively perturbs individual SAE features by observing
token-level internal activations and optimizing these perturbations based on verifiable rewards.
However, existing SAE-based steering approaches face significant limitations: (1) contrastive datasets
or large activation storage are required to identify the direction of the steering, and (2) they rely on
the hidden states of context tokens to select both the features and their coefficients.

To address these limitations, we introduce Adaptive Feature Masking (AFM) and employ a highepsilon regime to encourage diverse feature discovery. CRL improves performance across diverse
tasks including question answering, bias mitigation, jailbreak prevention, hallucination reduction,
and multi-step reasoning. Notably, on the jailbreak benchmark XSTest with the Gemma 2 2B model,
our method boosts accuracy from 73% to 85% using only 50 training samples.
Together, these results demonstrate the universal applicability of CRL across benchmarks and
highlight a practical pathway for employing mechanistic interpretability toward the reward-aligned
control of AI behavior.
2 BACKGROUND
Mechanistic interpretability aims to reverse-engineer neural networks into human-interpretable
components (Olah et al., 2020; Elhage et al., 2021). A central challenge in this endeavor is the
superposition phenomenon, where neural networks learn to represent more features than available
dimensions (Elhage et al., 2022). This efficient representation strategy complicates efforts to identify
the consistent role of specific latent dimensions.
2.1 SPARSE AUTOENCODERS
Sparse Autoencoders (Huben et al., 2023; Bricken et al., 2023) address the superposition problem
by learning to decompose neural activations into interpretable, sparse features. Given an activation
vector x ∈ R
d
, an SAE learns an encoder fenc : R
d → R
k
and decoder fdec : R
k → R
d where
k ≫ d, such that:
z = fenc(x) = Activation(Wencx + benc) (1)
xˆ = fdec(z) = Wdecz + bdec (2)
The training objective is usually a combination of reconstruction loss with sparsity regularization:
L = ∥x − xˆ∥
2 + λ∥z∥1 (3)
2.2 CONTROL REINFORCEMENT LEARNING FRAMEWORK
We formulate the control of transformer representations as a Markov Decision Process (MDP) where
the agent learns to manipulate sparse autoencoder (SAE) features to optimize task-specific rewards.
Let x
(ℓ) ∈ R
d denote the residual stream activations at layer ℓ for a target token position, where d
is the hidden dimension of the transformer model. Our framework supports both single-layer and
multi-layer interventions across different transformer layers.
Given a pre-trained SAE with encoder W(ℓ)
enc ∈ R
d×ddict and decoder W(ℓ)
dec ∈ R
ddict×d
, the sparse
feature activations are computed as:
f
(ℓ) = Activation(x
(ℓ)W(ℓ)
enc + b
(ℓ)
enc) (4)
where f
(ℓ) ∈ R
ddict represents the sparse feature activations and ddict is the dictionary size.
The MDP is defined by the tuple (S, A,P, R) where:
• State Space S: The observation is s = x
(ℓ) ∈ R
d
, the residual stream activation at the
target layer and token position.
• Action Space A: For computational simplicity, actions are one-hot vectors a ∈ {0, 1}
ddict
selecting a single SAE feature to activate, reducing the exploration challenge in highdimensional feature spaces.
• Transition Function P: Deterministic transition governed by the transformer’s forward
pass with steering applied.
• Reward Function R: Task-specific rewards r based on output quality evaluation.

The steering mechanism applies perturbations to the residual stream via:
x˜
(ℓ) = x
(ℓ) + aW(ℓ)
dec (5)
where a is the one-hot action vector selecting which SAE feature to activate, and x˜
(ℓ)
represents the
steered activations.
3 METHOD
Figure 1: Overview of the Control Reinforcement Learning (CRL) framework, showing the interaction
between policy network, critic network, and SAE feature steering mechanism.
3.1 TRAINING ARCHITECTURE
Our training architecture consists of a policy network πθ, a critic network Vϕ, and the steering
mechanism integrated into the transformer’s forward pass.
3.1.1 POLICY NETWORK
The policy network πθ : R
d → R
ddict maps residual stream observations to SAE feature selection
logits. We implement this as an MLP:
µ = πθ(s) (6)
a = Categorical(softmax(µ)) (7)
where the action a represents the selected SAE feature index sampled from a categorical distribution
over ddict features. The network depth is controlled by the policy_deep hyperparameter.
3.1.2 CRITIC NETWORK
The critic network Vϕ : R
d → R estimates the state value function:
Vϕ(s) = Eπθ
[r | s] (8)
We implement the critic as an MLP with configurable depth controlled by the critic_deep hyperparameter, using various activation functions based on empirical performance.
3.1.3 PPO TRAINING ALGORITHM
We utilize Proximal Policy Optimization (PPO) to train both the policy and critic networks. For
our categorical feature selection, the policy network outputs logits over all features, and the PPO
objective becomes:
Lpolicy(θ) = E [min (rt(θ)At, clip(rt(θ), 1 − ϵ, 1 + ϵ)At)] (9)
Lcritic(ϕ) = E

(Vϕ(s) − r)
2

(10)
where rt(θ) = πθ(a|s)
πθold (a|s)
is the probability ratio for the selected feature, At = r − Vϕ(s) is the
advantage estimate, and ϵ = 0.2 is the clipping parameter.
3
Figure 2: Overall performance comparison across different benchmarks showing CRL improvements
over baseline models.
3.2 REWARD SIGNAL DESIGN
We design task-specific reward functions that evaluate output quality. For multiple-choice tasks, we
use exact match rewards:
r(ˆy, y∗
) = 
1 if yˆ = y
∗
0 otherwise (11)
For tasks requiring partial credit evaluation, we employ token-level F1 scores or other appropriate
metrics based on the task requirements.
3.3 ADAPTIVE FEATURE MASKING
To optimize exploration within a constrained feature space, we introduce Adaptive Feature Masking
(AFM). This technique dynamically masks certain features during training to encourage the policy
network to explore diverse feature combinations and prevent premature convergence to suboptimal
feature selections.
The masking strategy operates at three levels:
• None: No masking applied, allowing access to all features
• Generation: Mask features that are not active during generation
• All: Comprehensive masking based on feature importance scores
3.4 EPSILON-GREEDY EXPLORATION
We employ an epsilon-greedy exploration strategy (ϵ = 0.01) to encourage diverse feature discovery
during training. This exploration mechanism helps prevent the policy from converging to locally
optimal feature selections and promotes the discovery of more effective feature combinations.
The exploration mechanism is balanced with exploitation through careful scheduling of the epsilon parameter throughout training, ensuring that the policy maintains sufficient exploration while gradually
focusing on high-reward feature selections.
4

Target

1:45분 수정


 ~/cloudfiles/code/Users/Seonglae.Cho/corr-steer  main *1 ················ azureml_py38    azureuser@a100research  22:34:42 
❯ python train.py train --layer=global --task=harmbench                                  
/anaconda/envs/azureml_py38/lib/python3.10/site-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "validate" in "CorrConfig" shadows an attribute in parent "BaseModel"
  warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [01:00<00:00, 30.08s/it]
Training correlations for layers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
Collecting correlations:   0%|        Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Device set to use cuda:0
Collecting correlations:   0%|▏           You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a datasets: 8it [00:16,  2.06s/it]
Collecting correlations:   3%|█▋                                                             | 108/4000 [01:10<42:14,  1.54samples/s]
Layer 1: pos 9572 r=0.6924, neg None        
Layer 2: pos 6712 r=0.6920, neg None
Layer 3: pos 16207 r=0.6858, neg None
Layer 4: pos 3109 r=0.6960, neg None
Layer 5: pos 11099 r=0.7374, neg None
Layer 6: pos 12241 r=0.7345, neg None
Layer 7: pos 11722 r=0.7794, neg None
Layer 8: pos 8642 r=0.7455, neg None
Layer 9: pos 9298 r=0.7751, neg None
Layer 10: pos 3037 r=0.7230, neg None
Layer 11: pos 6905 r=0.7349, neg None
Layer 12: pos 12039 r=0.7407, neg None
Layer 13: pos 6715 r=0.7092, neg None
Layer 14: pos 2949 r=0.7391, neg None
Layer 15: pos 1570 r=0.7418, neg None
Layer 16: pos 5113 r=0.7427, neg None
Layer 17: pos 5887 r=0.7196, neg None
Layer 18: pos 1411 r=0.7119, neg None
Layer 19: pos 324 r=0.7102, neg None
Layer 20: pos 5192 r=0.7175, neg None
Layer 21: pos 7129 r=0.7211, neg None
Layer 22: pos 3311 r=0.7465, neg None
Layer 23: pos 11246 r=0.7108, neg None
Layer 24: pos 12773 r=0.6995, neg None
Layer 25: pos 3912 r=0.7106, neg None
Global best: Layer 7 using positive feature 11722 with correlation 0.7794
CorrSteer (global) saved to checkpoints/gemma2b_harmbench_global.json
Analyzing top correlation features...
Layer 1: Using positive feature 9572 with coefficient 5.2061 (corr=0.6924) [SAE]
Layer 2: Using positive feature 6712 with coefficient 5.6994 (corr=0.6920) [SAE]
Layer 3: Using positive feature 16207 with coefficient 2.5830 (corr=0.6858) [SAE]
Layer 4: Using positive feature 3109 with coefficient 5.8908 (corr=0.6960) [SAE]
Layer 5: Using positive feature 11099 with coefficient 16.9340 (corr=0.7374) [SAE]
Layer 6: Using positive feature 12241 with coefficient 7.3383 (corr=0.7345) [SAE]
Layer 7: Using positive feature 11722 with coefficient 5.0351 (corr=0.7794) [SAE]
Layer 8: Using positive feature 8642 with coefficient 8.7294 (corr=0.7455) [SAE]
Layer 9: Using positive feature 9298 with coefficient 7.5245 (corr=0.7751) [SAE]
Layer 10: Using positive feature 3037 with coefficient 6.6667 (corr=0.7230) [SAE]
Layer 11: Using positive feature 6905 with coefficient 13.8096 (corr=0.7349) [SAE]
Layer 12: Using positive feature 12039 with coefficient 5.2533 (corr=0.7407) [SAE]
Layer 13: Using positive feature 6715 with coefficient 6.9916 (corr=0.7092) [SAE]
Layer 14: Using positive feature 2949 with coefficient 16.6202 (corr=0.7391) [SAE]
Layer 15: Using positive feature 1570 with coefficient 23.8238 (corr=0.7418) [SAE]
Layer 16: Using positive feature 5113 with coefficient 21.8320 (corr=0.7427) [SAE]
Layer 17: Using positive feature 5887 with coefficient 11.3889 (corr=0.7196) [SAE]
Layer 18: Using positive feature 1411 with coefficient 20.5374 (corr=0.7119) [SAE]
Layer 19: Using positive feature 324 with coefficient 35.6101 (corr=0.7102) [SAE]
Layer 20: Using positive feature 5192 with coefficient 45.6623 (corr=0.7175) [SAE]
Layer 21: Using positive feature 7129 with coefficient 33.2255 (corr=0.7211) [SAE]
Layer 22: Using positive feature 3311 with coefficient 19.0001 (corr=0.7465) [SAE]
Layer 23: Using positive feature 11246 with coefficient 61.6424 (corr=0.7108) [SAE]
Layer 24: Using positive feature 12773 with coefficient 50.3317 (corr=0.6995) [SAE]
Layer 25: Using positive feature 3912 with coefficient 57.4309 (corr=0.7106) [SAE]
Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████| 280/280 [01:05<00:00,  4.26it/s]
Fixed feature accuracy: 67.50%
Results saved to checkpoints/gemma2b_harmbench_multi_25.json
Evaluation accuracy saved to checkpoints/gemma2b_harmbench_global_accuracy.json (accuracy=67.50%)

Current


 ~/cloudfiles/code/Users/Seonglae.Cho/ControlRL  main ⇡1 +4 !3 · 11m 57s  azureml_py38    azureuser@a100research  00:07:19 
❯ python train.py train --task=harmbench --layers=all --eval --flatten
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.77s/it]
wandb: Currently logged in as: seonglae (texonom) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.21.1
wandb: Run data is saved locally in /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/wandb/run-20250822_002120-iu0mfhzj
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120
wandb: ⭐️ View project at https://wandb.ai/texonom/control_rl
wandb: 🚀 View run at https://wandb.ai/texonom/control_rl/runs/iu0mfhzj
Training Steps:   0%|                                                                                         | 0/14 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect.
Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Step 0: Avg Train Acc 0.3750, Val Acc 0.5000
  Layer 1: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 43, Avg Activation: 0.0000, Avg Act Values: 7.4332, Top Corr: 1.0000 (idx: 14790, coeff: 40.3346)
  Layer 2: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 25, Avg Activation: 0.0000, Avg Act Values: 11.2021, Top Corr: 1.0000 (idx: 8863, coeff: 37.5957)
  Layer 3: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 40, Avg Activation: 0.0000, Avg Act Values: 7.7847, Top Corr: 0.9998 (idx: 13505, coeff: 45.9490)
  Layer 4: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 35, Avg Activation: 0.0000, Avg Act Values: 4.9902, Top Corr: 0.9999 (idx: 5786, coeff: 24.3388)
  Layer 5: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 33, Avg Activation: 0.0000, Avg Act Values: 10.6308, Top Corr: 1.0000 (idx: 7569, coeff: 25.9536)
  Layer 6: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 36, Avg Activation: 0.0000, Avg Act Values: 7.4637, Top Corr: 1.0000 (idx: 13114, coeff: 23.3237)
  Layer 7: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 34, Avg Activation: 0.0000, Avg Act Values: 7.9720, Top Corr: 0.9999 (idx: 11358, coeff: 23.2175)
  Layer 8: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 28, Avg Activation: 0.0000, Avg Act Values: 5.9078, Top Corr: 0.9999 (idx: 6221, coeff: 23.3399)
  Layer 9: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 24, Avg Activation: 0.0000, Avg Act Values: 9.2464, Top Corr: 1.0000 (idx: 8675, coeff: 12.2697)
  Layer 10: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 24, Avg Activation: 0.0000, Avg Act Values: 11.6061, Top Corr: 0.9998 (idx: 8361, coeff: 17.4053)
  Layer 11: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 17, Avg Activation: 0.0000, Avg Act Values: 9.9951, Top Corr: 0.9999 (idx: 16251, coeff: 30.7651)
  Layer 12: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 23, Avg Activation: 0.0000, Avg Act Values: 16.1377, Top Corr: 1.0000 (idx: 4854, coeff: 15.5613)
  Layer 13: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 20, Avg Activation: 0.0000, Avg Act Values: 12.6093, Top Corr: 1.0000 (idx: 15254, coeff: 13.6998)
  Layer 14: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 25, Avg Activation: 0.0000, Avg Act Values: 10.6163, Top Corr: 0.9999 (idx: 10643, coeff: 5.3389)
  Layer 15: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 19, Avg Activation: 0.0000, Avg Act Values: 9.6818, Top Corr: 0.9999 (idx: 8902, coeff: 6.2690)
  Layer 16: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 19, Avg Activation: 0.0000, Avg Act Values: 14.5731, Top Corr: 1.0000 (idx: 5113, coeff: 22.2766)
  Layer 17: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 18, Avg Activation: 0.0000, Avg Act Values: 13.4663, Top Corr: 0.9997 (idx: 1200, coeff: 18.2989)
  Layer 18: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 18, Avg Activation: 0.0000, Avg Act Values: 21.1193, Top Corr: 0.9999 (idx: 1504, coeff: 7.6450)
  Layer 19: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 14, Avg Activation: 0.0000, Avg Act Values: 44.7919, Top Corr: 0.9996 (idx: 9637, coeff: 57.7203)
  Layer 20: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 15, Avg Activation: 0.0000, Avg Act Values: 17.9265, Top Corr: 0.9997 (idx: 3423, coeff: 15.1552)
  Layer 21: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 11, Avg Activation: 0.0000, Avg Act Values: 20.1886, Top Corr: 0.9992 (idx: 5834, coeff: 83.5779)
  Layer 22: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 11, Avg Activation: 0.0000, Avg Act Values: 17.6585, Top Corr: 0.9999 (idx: 14848, coeff: 15.0354)
  Layer 23: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 9, Avg Activation: 0.0000, Avg Act Values: 29.8452, Top Corr: 0.9994 (idx: 13403, coeff: 20.9856)
  Layer 24: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 12, Avg Activation: 0.0000, Avg Act Values: 25.5663, Top Corr: 0.9999 (idx: 5380, coeff: 16.8423)
  Layer 25: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 8, Avg Activation: 0.0000, Avg Act Values: 24.8593, Top Corr: 0.9999 (idx: 1558, coeff: 32.6608)
Training Steps: 100%|████████████████████████████████████████████████████████████████████████████████| 14/14 [03:28<00:00, 14.89s/it]

=== Final Correlation Results ===
Layer 1: Using positive feature 1513 with coefficient 7.5780 (corr=0.7034) [SAE]
Layer 2: Using positive feature 6712 with coefficient 6.1642 (corr=0.7028) [SAE]
Layer 3: Using positive feature 16207 with coefficient 2.6573 (corr=0.7362) [SAE]
Layer 4: Using positive feature 3109 with coefficient 6.1323 (corr=0.7408) [SAE]
Layer 5: Using positive feature 11099 with coefficient 17.5734 (corr=0.7783) [SAE]
Layer 6: Using positive feature 12241 with coefficient 7.7448 (corr=0.7763) [SAE]
Layer 7: Using positive feature 11099 with coefficient 19.6308 (corr=0.7847) [SAE]
Layer 8: Using positive feature 8642 with coefficient 9.1598 (corr=0.7897) [SAE]
Layer 9: Using positive feature 9298 with coefficient 7.5653 (corr=0.7680) [SAE]
Layer 10: Using positive feature 5996 with coefficient 12.2603 (corr=0.7276) [SAE]
Layer 11: Using positive feature 6905 with coefficient 14.4484 (corr=0.7744) [SAE]
Layer 12: Using positive feature 13016 with coefficient 12.3013 (corr=0.7407) [SAE]
Layer 13: Using positive feature 6715 with coefficient 7.1451 (corr=0.7414) [SAE]
Layer 14: Using positive feature 2949 with coefficient 17.1668 (corr=0.7632) [SAE]
Layer 15: Using positive feature 1570 with coefficient 24.0848 (corr=0.7495) [SAE]
Layer 16: Using positive feature 5113 with coefficient 22.6561 (corr=0.7864) [SAE]
Layer 17: Using positive feature 14231 with coefficient 15.3567 (corr=0.7553) [SAE]
Layer 18: Using positive feature 1411 with coefficient 21.9191 (corr=0.7644) [SAE]
Layer 19: Using positive feature 324 with coefficient 36.2297 (corr=0.7207) [SAE]
Layer 20: Using positive feature 14645 with coefficient 15.4051 (corr=0.7253) [SAE]
Layer 21: Using positive feature 7129 with coefficient 34.9367 (corr=0.7407) [SAE]
Layer 22: Using positive feature 3311 with coefficient 19.7669 (corr=0.7797) [SAE]
Layer 23: Using positive feature 11246 with coefficient 64.2619 (corr=0.7479) [SAE]
Layer 24: Using positive feature 12433 with coefficient 62.1589 (corr=0.7288) [SAE]
Layer 25: Using positive feature 3912 with coefficient 60.3746 (corr=0.7381) [SAE]
Config
        model: gemma2b
        task: harmbench
        layers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
        select_token: False
        decode: False
        category: None
        cot: False
        
Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████| 280/280 [02:11<00:00,  2.14it/s]
Final harmbench Accuracy with Steering: 45.71%
Results saved to ./checkpoints/gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120/harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_steered.json
Stats saved to ./checkpoints/gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120/harmbench_eval.json
Every outputs are saved to the folder ./checkpoints/gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120
wandb: 
wandb: 🚀 View run gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120 at: https://wandb.ai/texonom/control_rl/runs/iu0mfhzj
wandb: Find logs at: ../../../../../../../mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/wandb/run-20250822_002120-iu0mfhzj/logs

 ~/cloudfiles/code/Users/Seonglae.Cho/ControlRL  main ⇡1 +4 !3 · 12m 29s  azureml_py38    azureuser@a100research  00:30:21 
❯


Layer 1: Using positive feature 1513 with coefficient 7.3448 (corr=0.7034) [SAE]
Layer 2: Using positive feature 6712 with coefficient 5.7849 (corr=0.7028) [SAE]
Layer 3: Using positive feature 16207 with coefficient 2.6573 (corr=0.7362) [SAE]
Layer 4: Using positive feature 3109 with coefficient 6.1323 (corr=0.7408) [SAE]
Layer 5: Using positive feature 11099 with coefficient 17.5734 (corr=0.7783) [SAE]
Layer 6: Using positive feature 12241 with coefficient 7.7448 (corr=0.7763) [SAE]
Layer 7: Using positive feature 11099 with coefficient 19.6308 (corr=0.7847) [SAE]
Layer 8: Using positive feature 8642 with coefficient 9.1598 (corr=0.7897) [SAE]
Layer 9: Using positive feature 9298 with coefficient 7.5653 (corr=0.7680) [SAE]
Layer 10: Using positive feature 5996 with coefficient 12.2603 (corr=0.7276) [SAE]
Layer 11: Using positive feature 6905 with coefficient 14.4484 (corr=0.7744) [SAE]
Layer 12: Using positive feature 13016 with coefficient 12.3013 (corr=0.7407) [SAE]
Layer 13: Using positive feature 6715 with coefficient 7.1451 (corr=0.7414) [SAE]
Layer 14: Using positive feature 2949 with coefficient 16.9027 (corr=0.7632) [SAE]
Layer 15: Using positive feature 1570 with coefficient 24.0848 (corr=0.7495) [SAE]
Layer 16: Using positive feature 5113 with coefficient 22.6561 (corr=0.7864) [SAE]
Layer 17: Using positive feature 14231 with coefficient 15.3567 (corr=0.7553) [SAE]
Layer 18: Using positive feature 1411 with coefficient 21.2447 (corr=0.7644) [SAE]
Layer 19: Using positive feature 324 with coefficient 36.2297 (corr=0.7207) [SAE]
Layer 20: Using positive feature 14645 with coefficient 14.6941 (corr=0.7253) [SAE]
Layer 21: Using positive feature 7129 with coefficient 34.3992 (corr=0.7407) [SAE]
Layer 22: Using positive feature 3311 with coefficient 19.4628 (corr=0.7797) [SAE]
Layer 23: Using positive feature 11246 with coefficient 63.2732 (corr=0.7479) [SAE]
Layer 24: Using positive feature 12433 with coefficient 61.2026 (corr=0.7288) [SAE]
Layer 25: Using positive feature 3912 with coefficient 59.4457 (corr=0.7381) [SAE]

CRL Paper Plan

Option

or simply

Prompts

Target

Current

Recommendations