change Motivation to attribution grpah
그리고 attribution graph 같은거 너무 복잡하다 human interpretblae - single control
we

inference time alignment
성능향상 tracking 방식 제안 model diffing 느낌인가
figure font size
llama gsm8k 없어도 되고왜냐면 architectural generarlization 만 보여주면 됨 single famility 넘어서 multiple architecture tested
layer wise steerirng result 하고 해상도
crl corrsteer 같은 feature 발견하나 확인 - 이건나중에
더해진 feature correlation 도 포함
jumprelu? 정확구현확인
selective performance llama
system diagram 에서 crl token layer 둘다 direction?
이미지 화질 교체
논문 방향 싱글레이어로 가되, 전체가능하던 심플 태스크 성능 언급 89 등
mathmatical representation 은 background 랑 method 두개통일 sae
layer all shared markov decision process ^{ell} 로 구성하기
method onehot 짜치니 argmax topk 하고 1 으로
feature diversity 는 policy layer depth 늘어날수록 적었고 critic loss 도 critic layer depth 작을수록 좋았다.
Option
- Correlation Steering Warmup
- —epsilon
- epsilon 이랑 act 랑 masing 적용위치 및 순서
- jumprelu
- 조금 낮아짐
- 근데 마스킹하고 같이하니 78 달성 뭐지 ㄷ
- multi feature or single feature
- threshold initizlization
- stage 1 에도 할지
- —q
- 변함없음
- loss softmax
- 보통 낮은데
- (—grpo)
더 효과적인 sparse selection 구조
- token decay 혹은 그냥 correation 더하기보다 곱하기 - 성능 100 으로 73한거 유지로 hyperparameter 삭제로 좋다, feature 동일사용은 여전히 같다.
음수 corr 음수 logit 경우 고려해야하나
실데 스티어링 corr 업에이트할때 선택한 sae feature 랑 더해진 coeff 로 corr 계산해야함
- activation decay 로 steering 0.99 나 0.95 로 줄여나갈까 - 성능유지는 했는데 토큰 길이 짧아서 별의미없 73.21
- 현재토큰 활성화 중에서만 masking 하면 되잖아 성능만 제발 유지되면 encode 중에서
- 혹은 현재꺼 반대 마스킹 새로운거 더하기위해
- correlation 곱해주는 곳에다가 1 아니면 corr 이렇게 해도 되고
1. Gumbel Softmax + Top-K Selection
성능 떨어짐
class GumbelTopK(nn.Module): def __init__(self, temperature=1.0, k=1): self.temperature = temperature self.k = k def forward(self, logits): # Gumbel noise for differentiable sampling gumbel_noise = -torch.log(-torch.log(torch.rand_like(logits))) noisy_logits = (logits + gumbel_noise) / self.temperature # Differentiable top-k selection return torch.topk(noisy_logits, self.k, dim=-1)
2. Sparse Attention Mechanism
성능유지
class SparseAttention(nn.Module): def __init__(self, dim, sparsity=0.1): self.attention = nn.Linear(dim, dim) self.sparsity = sparsity def forward(self, x, correlation_weights): attn_weights = torch.softmax(self.attention(x), dim=-1) # Apply sparsity mask based on correlation sparse_mask = correlation_weights > correlation_weights.quantile(1-self.sparsity) return attn_weights * sparse_mask
3. Straight-Through Estimator (STE)
성능유지
class StraightThroughTopK(nn.Module): def forward(self, logits, k=1): # Forward: discrete top-k _, indices = torch.topk(logits, k, dim=-1) y_hard = torch.zeros_like(logits).scatter_(-1, indices, 1.0) # Backward: continuous gradients y_soft = torch.softmax(logits, dim=-1) return y_hard - y_soft.detach() + y_soft
4. Learnable Sparse Gates
성능유지
class SparseGate(nn.Module): def __init__(self, dim): self.gate = nn.Parameter(torch.ones(dim)) self.threshold = nn.Parameter(torch.tensor(0.5)) def forward(self, x, correlation_bias=None): gate_scores = torch.sigmoid(self.gate) if correlation_bias is not None: gate_scores = gate_scores + correlation_bias # Learnable sparsity threshold sparse_mask = (gate_scores > self.threshold).float() return x * sparse_mask
1. Gumbel Softmax: 미분 가능한 discrete sampling
2. Sparse Attention: correlation을 attention weight로 활용
3. STE: discrete selection + continuous gradients
or simply
- Token-wise context-dependent correlation → decreased
else: # Stage 2: PPO + context-dependent correlation boost raw_logits = self.fc1(observation) # (batch, seq_len, dict_size) if selection_weights is not None: # Make correlation boost context-dependent by scaling with observation magnitude obs_magnitude = torch.norm(observation, dim=-1, keepdim=True) # (batch, seq_len, 1) context_scale = torch.sigmoid(obs_magnitude) # Normalize to [0,1] selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * context_scale # (batch, seq_len, dict_size) raw_logits = raw_logits + selection_boost
- Token position linear freedom → same (후반강조는 오히려 낮아지고)
else: # Stage 2: PPO + token-wise correlation boost raw_logits = self.fc1(observation) # (batch, seq_len, dict_size) if selection_weights is not None: # Method 1: Position-dependent correlation (different per token) batch_size, seq_len, dict_size = raw_logits.shape # Create position-dependent correlation weights # Use a simple linear combination based on token position position_weights = torch.linspace(0.1, 1.0, seq_len, device=raw_logits.device) position_weights = position_weights.view(1, seq_len, 1) # (1, seq_len, 1) # Apply position-dependent correlation boost selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * position_weights raw_logits = raw_logits + selection_boost
- Attention-based correlation weighting → same
else: # Stage 2: PPO + adaptive correlation boost raw_logits = self.fc1(observation) # (batch, seq_len, dict_size) if selection_weights is not None: # Option 1: Simple scaling (기존 방식) # selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) # Option 2: Context-dependent scaling # Compute attention weights based on PPO logits strength ppo_strength = torch.norm(raw_logits, dim=-1, keepdim=True) # (batch, seq_len, 1) attention_weights = torch.softmax(ppo_strength.squeeze(-1), dim=-1).unsqueeze(-1) # (batch, seq_len, 1) # Apply correlation boost proportional to PPO confidence selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * attention_weights raw_logits = raw_logits + selection_boost
- Learnable mixing parameter → same
else: # Stage 2: PPO + learnable correlation boost raw_logits = self.fc1(observation) # (batch, seq_len, dict_size) if selection_weights is not None: # Apply learnable correlation boost selection_boost = selection_weights.unsqueeze(0).unsqueeze(0) * self.mixing_weight raw_logits = raw_logits + selection_boost epsilon = torch.randn_like(raw_logits) * self.epsilon raw_logits_noisy = raw_logits + epsilon if isinstance(self.act, JumpReLU): logits_noisy = self.act(raw_logits_noisy, critic_values) else:
Prompts
@train.py @ppo.py @steer.py read line by line and get comprehensive understanding 1 INTRODUCTION A 2005 Nature study revealed human neurons firing sparsely for specific individuals; these cells responded identically to photos, drawings, or even written names. The neurons were multimodal, encoding pure conceptual information divorced from sensory modality. Rather than relying on extremely distributed coding, empirical evidence from both human brains and neural networks suggests that neurons representing ’single concepts’ can indeed exist. In other words, there’s no fundamental constraint requiring basis features in superposed neuron activations to be dense. Drawing inspiration from this, we propose Control Reinforcement Learning (CRL), which forces sparse activation in Transformer-based LLMs, injecting token-specific perturbations to learn the objective function. Recent work in mechanistic interpretability has shown that sparse autoencoders (SAEs) can extract sparse, monosemantic features from superpositioned dense activations (Bricken et al., 2023). Meanwhile, findings in computational neuroscience suggest that brain architectures utilize both densely and sparsely activated neurons. Inspired by this analogy and leveraging the steerable nature of SAE features, we propose a method to steer transformer representations without modifying the model’s original parameters. Our approach trains an MLP-based control model that selectively perturbs individual SAE features by observing token-level internal activations and optimizing these perturbations based on verifiable rewards. However, existing SAE-based steering approaches face significant limitations: (1) contrastive datasets or large activation storage are required to identify the direction of the steering, and (2) they rely on the hidden states of context tokens to select both the features and their coefficients. To address these limitations, we introduce Adaptive Feature Masking (AFM) and employ a highepsilon regime to encourage diverse feature discovery. CRL improves performance across diverse tasks including question answering, bias mitigation, jailbreak prevention, hallucination reduction, and multi-step reasoning. Notably, on the jailbreak benchmark XSTest with the Gemma 2 2B model, our method boosts accuracy from 73% to 85% using only 50 training samples. Together, these results demonstrate the universal applicability of CRL across benchmarks and highlight a practical pathway for employing mechanistic interpretability toward the reward-aligned control of AI behavior. 2 BACKGROUND Mechanistic interpretability aims to reverse-engineer neural networks into human-interpretable components (Olah et al., 2020; Elhage et al., 2021). A central challenge in this endeavor is the superposition phenomenon, where neural networks learn to represent more features than available dimensions (Elhage et al., 2022). This efficient representation strategy complicates efforts to identify the consistent role of specific latent dimensions. 2.1 SPARSE AUTOENCODERS Sparse Autoencoders (Huben et al., 2023; Bricken et al., 2023) address the superposition problem by learning to decompose neural activations into interpretable, sparse features. Given an activation vector x ∈ R d , an SAE learns an encoder fenc : R d → R k and decoder fdec : R k → R d where k ≫ d, such that: z = fenc(x) = Activation(Wencx + benc) (1) xˆ = fdec(z) = Wdecz + bdec (2) The training objective is usually a combination of reconstruction loss with sparsity regularization: L = ∥x − xˆ∥ 2 + λ∥z∥1 (3) 2.2 CONTROL REINFORCEMENT LEARNING FRAMEWORK We formulate the control of transformer representations as a Markov Decision Process (MDP) where the agent learns to manipulate sparse autoencoder (SAE) features to optimize task-specific rewards. Let x (ℓ) ∈ R d denote the residual stream activations at layer ℓ for a target token position, where d is the hidden dimension of the transformer model. Our framework supports both single-layer and multi-layer interventions across different transformer layers. Given a pre-trained SAE with encoder W(ℓ) enc ∈ R d×ddict and decoder W(ℓ) dec ∈ R ddict×d , the sparse feature activations are computed as: f (ℓ) = Activation(x (ℓ)W(ℓ) enc + b (ℓ) enc) (4) where f (ℓ) ∈ R ddict represents the sparse feature activations and ddict is the dictionary size. The MDP is defined by the tuple (S, A,P, R) where: • State Space S: The observation is s = x (ℓ) ∈ R d , the residual stream activation at the target layer and token position. • Action Space A: For computational simplicity, actions are one-hot vectors a ∈ {0, 1} ddict selecting a single SAE feature to activate, reducing the exploration challenge in highdimensional feature spaces. • Transition Function P: Deterministic transition governed by the transformer’s forward pass with steering applied. • Reward Function R: Task-specific rewards r based on output quality evaluation. The steering mechanism applies perturbations to the residual stream via: x˜ (ℓ) = x (ℓ) + aW(ℓ) dec (5) where a is the one-hot action vector selecting which SAE feature to activate, and x˜ (ℓ) represents the steered activations. 3 METHOD Figure 1: Overview of the Control Reinforcement Learning (CRL) framework, showing the interaction between policy network, critic network, and SAE feature steering mechanism. 3.1 TRAINING ARCHITECTURE Our training architecture consists of a policy network πθ, a critic network Vϕ, and the steering mechanism integrated into the transformer’s forward pass. 3.1.1 POLICY NETWORK The policy network πθ : R d → R ddict maps residual stream observations to SAE feature selection logits. We implement this as an MLP: µ = πθ(s) (6) a = Categorical(softmax(µ)) (7) where the action a represents the selected SAE feature index sampled from a categorical distribution over ddict features. The network depth is controlled by the policy_deep hyperparameter. 3.1.2 CRITIC NETWORK The critic network Vϕ : R d → R estimates the state value function: Vϕ(s) = Eπθ [r | s] (8) We implement the critic as an MLP with configurable depth controlled by the critic_deep hyperparameter, using various activation functions based on empirical performance. 3.1.3 PPO TRAINING ALGORITHM We utilize Proximal Policy Optimization (PPO) to train both the policy and critic networks. For our categorical feature selection, the policy network outputs logits over all features, and the PPO objective becomes: Lpolicy(θ) = E [min (rt(θ)At, clip(rt(θ), 1 − ϵ, 1 + ϵ)At)] (9) Lcritic(ϕ) = E (Vϕ(s) − r) 2 (10) where rt(θ) = πθ(a|s) πθold (a|s) is the probability ratio for the selected feature, At = r − Vϕ(s) is the advantage estimate, and ϵ = 0.2 is the clipping parameter. 3 Figure 2: Overall performance comparison across different benchmarks showing CRL improvements over baseline models. 3.2 REWARD SIGNAL DESIGN We design task-specific reward functions that evaluate output quality. For multiple-choice tasks, we use exact match rewards: r(ˆy, y∗ ) = 1 if yˆ = y ∗ 0 otherwise (11) For tasks requiring partial credit evaluation, we employ token-level F1 scores or other appropriate metrics based on the task requirements. 3.3 ADAPTIVE FEATURE MASKING To optimize exploration within a constrained feature space, we introduce Adaptive Feature Masking (AFM). This technique dynamically masks certain features during training to encourage the policy network to explore diverse feature combinations and prevent premature convergence to suboptimal feature selections. The masking strategy operates at three levels: • None: No masking applied, allowing access to all features • Generation: Mask features that are not active during generation • All: Comprehensive masking based on feature importance scores 3.4 EPSILON-GREEDY EXPLORATION We employ an epsilon-greedy exploration strategy (ϵ = 0.01) to encourage diverse feature discovery during training. This exploration mechanism helps prevent the policy from converging to locally optimal feature selections and promotes the discovery of more effective feature combinations. The exploration mechanism is balanced with exploitation through careful scheduling of the epsilon parameter throughout training, ensuring that the policy maintains sufficient exploration while gradually focusing on high-reward feature selections. 4
Target
1:45분 수정
~/cloudfiles/code/Users/Seonglae.Cho/corr-steer main *1 ················ azureml_py38 azureuser@a100research 22:34:42 ❯ python train.py train --layer=global --task=harmbench /anaconda/envs/azureml_py38/lib/python3.10/site-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "validate" in "CorrConfig" shadows an attribute in parent "BaseModel" warnings.warn( Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [01:00<00:00, 30.08s/it] Training correlations for layers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25] Collecting correlations: 0%| Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Device set to use cuda:0 Collecting correlations: 0%|▏ You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a datasets: 8it [00:16, 2.06s/it] Collecting correlations: 3%|█▋ | 108/4000 [01:10<42:14, 1.54samples/s] Layer 1: pos 9572 r=0.6924, neg None Layer 2: pos 6712 r=0.6920, neg None Layer 3: pos 16207 r=0.6858, neg None Layer 4: pos 3109 r=0.6960, neg None Layer 5: pos 11099 r=0.7374, neg None Layer 6: pos 12241 r=0.7345, neg None Layer 7: pos 11722 r=0.7794, neg None Layer 8: pos 8642 r=0.7455, neg None Layer 9: pos 9298 r=0.7751, neg None Layer 10: pos 3037 r=0.7230, neg None Layer 11: pos 6905 r=0.7349, neg None Layer 12: pos 12039 r=0.7407, neg None Layer 13: pos 6715 r=0.7092, neg None Layer 14: pos 2949 r=0.7391, neg None Layer 15: pos 1570 r=0.7418, neg None Layer 16: pos 5113 r=0.7427, neg None Layer 17: pos 5887 r=0.7196, neg None Layer 18: pos 1411 r=0.7119, neg None Layer 19: pos 324 r=0.7102, neg None Layer 20: pos 5192 r=0.7175, neg None Layer 21: pos 7129 r=0.7211, neg None Layer 22: pos 3311 r=0.7465, neg None Layer 23: pos 11246 r=0.7108, neg None Layer 24: pos 12773 r=0.6995, neg None Layer 25: pos 3912 r=0.7106, neg None Global best: Layer 7 using positive feature 11722 with correlation 0.7794 CorrSteer (global) saved to checkpoints/gemma2b_harmbench_global.json Analyzing top correlation features... Layer 1: Using positive feature 9572 with coefficient 5.2061 (corr=0.6924) [SAE] Layer 2: Using positive feature 6712 with coefficient 5.6994 (corr=0.6920) [SAE] Layer 3: Using positive feature 16207 with coefficient 2.5830 (corr=0.6858) [SAE] Layer 4: Using positive feature 3109 with coefficient 5.8908 (corr=0.6960) [SAE] Layer 5: Using positive feature 11099 with coefficient 16.9340 (corr=0.7374) [SAE] Layer 6: Using positive feature 12241 with coefficient 7.3383 (corr=0.7345) [SAE] Layer 7: Using positive feature 11722 with coefficient 5.0351 (corr=0.7794) [SAE] Layer 8: Using positive feature 8642 with coefficient 8.7294 (corr=0.7455) [SAE] Layer 9: Using positive feature 9298 with coefficient 7.5245 (corr=0.7751) [SAE] Layer 10: Using positive feature 3037 with coefficient 6.6667 (corr=0.7230) [SAE] Layer 11: Using positive feature 6905 with coefficient 13.8096 (corr=0.7349) [SAE] Layer 12: Using positive feature 12039 with coefficient 5.2533 (corr=0.7407) [SAE] Layer 13: Using positive feature 6715 with coefficient 6.9916 (corr=0.7092) [SAE] Layer 14: Using positive feature 2949 with coefficient 16.6202 (corr=0.7391) [SAE] Layer 15: Using positive feature 1570 with coefficient 23.8238 (corr=0.7418) [SAE] Layer 16: Using positive feature 5113 with coefficient 21.8320 (corr=0.7427) [SAE] Layer 17: Using positive feature 5887 with coefficient 11.3889 (corr=0.7196) [SAE] Layer 18: Using positive feature 1411 with coefficient 20.5374 (corr=0.7119) [SAE] Layer 19: Using positive feature 324 with coefficient 35.6101 (corr=0.7102) [SAE] Layer 20: Using positive feature 5192 with coefficient 45.6623 (corr=0.7175) [SAE] Layer 21: Using positive feature 7129 with coefficient 33.2255 (corr=0.7211) [SAE] Layer 22: Using positive feature 3311 with coefficient 19.0001 (corr=0.7465) [SAE] Layer 23: Using positive feature 11246 with coefficient 61.6424 (corr=0.7108) [SAE] Layer 24: Using positive feature 12773 with coefficient 50.3317 (corr=0.6995) [SAE] Layer 25: Using positive feature 3912 with coefficient 57.4309 (corr=0.7106) [SAE] Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████| 280/280 [01:05<00:00, 4.26it/s] Fixed feature accuracy: 67.50% Results saved to checkpoints/gemma2b_harmbench_multi_25.json Evaluation accuracy saved to checkpoints/gemma2b_harmbench_global_accuracy.json (accuracy=67.50%)
Current
~/cloudfiles/code/Users/Seonglae.Cho/ControlRL main ⇡1 +4 !3 · 11m 57s azureml_py38 azureuser@a100research 00:07:19 ❯ python train.py train --task=harmbench --layers=all --eval --flatten Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.77s/it] wandb: Currently logged in as: seonglae (texonom) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.21.1 wandb: Run data is saved locally in /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/wandb/run-20250822_002120-iu0mfhzj wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120 wandb: ⭐️ View project at https://wandb.ai/texonom/control_rl wandb: 🚀 View run at https://wandb.ai/texonom/control_rl/runs/iu0mfhzj Training Steps: 0%| | 0/14 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect. Device set to use cuda:0 You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset Step 0: Avg Train Acc 0.3750, Val Acc 0.5000 Layer 1: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 43, Avg Activation: 0.0000, Avg Act Values: 7.4332, Top Corr: 1.0000 (idx: 14790, coeff: 40.3346) Layer 2: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 25, Avg Activation: 0.0000, Avg Act Values: 11.2021, Top Corr: 1.0000 (idx: 8863, coeff: 37.5957) Layer 3: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 40, Avg Activation: 0.0000, Avg Act Values: 7.7847, Top Corr: 0.9998 (idx: 13505, coeff: 45.9490) Layer 4: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 35, Avg Activation: 0.0000, Avg Act Values: 4.9902, Top Corr: 0.9999 (idx: 5786, coeff: 24.3388) Layer 5: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 33, Avg Activation: 0.0000, Avg Act Values: 10.6308, Top Corr: 1.0000 (idx: 7569, coeff: 25.9536) Layer 6: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 36, Avg Activation: 0.0000, Avg Act Values: 7.4637, Top Corr: 1.0000 (idx: 13114, coeff: 23.3237) Layer 7: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 34, Avg Activation: 0.0000, Avg Act Values: 7.9720, Top Corr: 0.9999 (idx: 11358, coeff: 23.2175) Layer 8: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 28, Avg Activation: 0.0000, Avg Act Values: 5.9078, Top Corr: 0.9999 (idx: 6221, coeff: 23.3399) Layer 9: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 24, Avg Activation: 0.0000, Avg Act Values: 9.2464, Top Corr: 1.0000 (idx: 8675, coeff: 12.2697) Layer 10: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 24, Avg Activation: 0.0000, Avg Act Values: 11.6061, Top Corr: 0.9998 (idx: 8361, coeff: 17.4053) Layer 11: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 17, Avg Activation: 0.0000, Avg Act Values: 9.9951, Top Corr: 0.9999 (idx: 16251, coeff: 30.7651) Layer 12: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 23, Avg Activation: 0.0000, Avg Act Values: 16.1377, Top Corr: 1.0000 (idx: 4854, coeff: 15.5613) Layer 13: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 20, Avg Activation: 0.0000, Avg Act Values: 12.6093, Top Corr: 1.0000 (idx: 15254, coeff: 13.6998) Layer 14: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 25, Avg Activation: 0.0000, Avg Act Values: 10.6163, Top Corr: 0.9999 (idx: 10643, coeff: 5.3389) Layer 15: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 19, Avg Activation: 0.0000, Avg Act Values: 9.6818, Top Corr: 0.9999 (idx: 8902, coeff: 6.2690) Layer 16: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 19, Avg Activation: 0.0000, Avg Act Values: 14.5731, Top Corr: 1.0000 (idx: 5113, coeff: 22.2766) Layer 17: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 18, Avg Activation: 0.0000, Avg Act Values: 13.4663, Top Corr: 0.9997 (idx: 1200, coeff: 18.2989) Layer 18: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 18, Avg Activation: 0.0000, Avg Act Values: 21.1193, Top Corr: 0.9999 (idx: 1504, coeff: 7.6450) Layer 19: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 14, Avg Activation: 0.0000, Avg Act Values: 44.7919, Top Corr: 0.9996 (idx: 9637, coeff: 57.7203) Layer 20: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 15, Avg Activation: 0.0000, Avg Act Values: 17.9265, Top Corr: 0.9997 (idx: 3423, coeff: 15.1552) Layer 21: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 11, Avg Activation: 0.0000, Avg Act Values: 20.1886, Top Corr: 0.9992 (idx: 5834, coeff: 83.5779) Layer 22: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 11, Avg Activation: 0.0000, Avg Act Values: 17.6585, Top Corr: 0.9999 (idx: 14848, coeff: 15.0354) Layer 23: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 9, Avg Activation: 0.0000, Avg Act Values: 29.8452, Top Corr: 0.9994 (idx: 13403, coeff: 20.9856) Layer 24: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 12, Avg Activation: 0.0000, Avg Act Values: 25.5663, Top Corr: 0.9999 (idx: 5380, coeff: 16.8423) Layer 25: Policy Loss 0.0000, Critic Loss 0.0000, Grad Norms (P/C) 0.00/0.00, Recon Loss 0.0000, Unique Indices: 8, Avg Activation: 0.0000, Avg Act Values: 24.8593, Top Corr: 0.9999 (idx: 1558, coeff: 32.6608) Training Steps: 100%|████████████████████████████████████████████████████████████████████████████████| 14/14 [03:28<00:00, 14.89s/it] === Final Correlation Results === Layer 1: Using positive feature 1513 with coefficient 7.5780 (corr=0.7034) [SAE] Layer 2: Using positive feature 6712 with coefficient 6.1642 (corr=0.7028) [SAE] Layer 3: Using positive feature 16207 with coefficient 2.6573 (corr=0.7362) [SAE] Layer 4: Using positive feature 3109 with coefficient 6.1323 (corr=0.7408) [SAE] Layer 5: Using positive feature 11099 with coefficient 17.5734 (corr=0.7783) [SAE] Layer 6: Using positive feature 12241 with coefficient 7.7448 (corr=0.7763) [SAE] Layer 7: Using positive feature 11099 with coefficient 19.6308 (corr=0.7847) [SAE] Layer 8: Using positive feature 8642 with coefficient 9.1598 (corr=0.7897) [SAE] Layer 9: Using positive feature 9298 with coefficient 7.5653 (corr=0.7680) [SAE] Layer 10: Using positive feature 5996 with coefficient 12.2603 (corr=0.7276) [SAE] Layer 11: Using positive feature 6905 with coefficient 14.4484 (corr=0.7744) [SAE] Layer 12: Using positive feature 13016 with coefficient 12.3013 (corr=0.7407) [SAE] Layer 13: Using positive feature 6715 with coefficient 7.1451 (corr=0.7414) [SAE] Layer 14: Using positive feature 2949 with coefficient 17.1668 (corr=0.7632) [SAE] Layer 15: Using positive feature 1570 with coefficient 24.0848 (corr=0.7495) [SAE] Layer 16: Using positive feature 5113 with coefficient 22.6561 (corr=0.7864) [SAE] Layer 17: Using positive feature 14231 with coefficient 15.3567 (corr=0.7553) [SAE] Layer 18: Using positive feature 1411 with coefficient 21.9191 (corr=0.7644) [SAE] Layer 19: Using positive feature 324 with coefficient 36.2297 (corr=0.7207) [SAE] Layer 20: Using positive feature 14645 with coefficient 15.4051 (corr=0.7253) [SAE] Layer 21: Using positive feature 7129 with coefficient 34.9367 (corr=0.7407) [SAE] Layer 22: Using positive feature 3311 with coefficient 19.7669 (corr=0.7797) [SAE] Layer 23: Using positive feature 11246 with coefficient 64.2619 (corr=0.7479) [SAE] Layer 24: Using positive feature 12433 with coefficient 62.1589 (corr=0.7288) [SAE] Layer 25: Using positive feature 3912 with coefficient 60.3746 (corr=0.7381) [SAE] Config model: gemma2b task: harmbench layers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25] select_token: False decode: False category: None cot: False Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████| 280/280 [02:11<00:00, 2.14it/s] Final harmbench Accuracy with Steering: 45.71% Results saved to ./checkpoints/gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120/harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_steered.json Stats saved to ./checkpoints/gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120/harmbench_eval.json Every outputs are saved to the folder ./checkpoints/gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120 wandb: wandb: 🚀 View run gemma2b_harmbench_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_ppo_1e-05_0822_002120 at: https://wandb.ai/texonom/control_rl/runs/iu0mfhzj wandb: Find logs at: ../../../../../../../mnt/batch/tasks/shared/LS_root/mounts/clusters/a100research/code/Users/Seonglae.Cho/ControlRL/wandb/run-20250822_002120-iu0mfhzj/logs ~/cloudfiles/code/Users/Seonglae.Cho/ControlRL main ⇡1 +4 !3 · 12m 29s azureml_py38 azureuser@a100research 00:30:21 ❯
Layer 1: Using positive feature 1513 with coefficient 7.3448 (corr=0.7034) [SAE] Layer 2: Using positive feature 6712 with coefficient 5.7849 (corr=0.7028) [SAE] Layer 3: Using positive feature 16207 with coefficient 2.6573 (corr=0.7362) [SAE] Layer 4: Using positive feature 3109 with coefficient 6.1323 (corr=0.7408) [SAE] Layer 5: Using positive feature 11099 with coefficient 17.5734 (corr=0.7783) [SAE] Layer 6: Using positive feature 12241 with coefficient 7.7448 (corr=0.7763) [SAE] Layer 7: Using positive feature 11099 with coefficient 19.6308 (corr=0.7847) [SAE] Layer 8: Using positive feature 8642 with coefficient 9.1598 (corr=0.7897) [SAE] Layer 9: Using positive feature 9298 with coefficient 7.5653 (corr=0.7680) [SAE] Layer 10: Using positive feature 5996 with coefficient 12.2603 (corr=0.7276) [SAE] Layer 11: Using positive feature 6905 with coefficient 14.4484 (corr=0.7744) [SAE] Layer 12: Using positive feature 13016 with coefficient 12.3013 (corr=0.7407) [SAE] Layer 13: Using positive feature 6715 with coefficient 7.1451 (corr=0.7414) [SAE] Layer 14: Using positive feature 2949 with coefficient 16.9027 (corr=0.7632) [SAE] Layer 15: Using positive feature 1570 with coefficient 24.0848 (corr=0.7495) [SAE] Layer 16: Using positive feature 5113 with coefficient 22.6561 (corr=0.7864) [SAE] Layer 17: Using positive feature 14231 with coefficient 15.3567 (corr=0.7553) [SAE] Layer 18: Using positive feature 1411 with coefficient 21.2447 (corr=0.7644) [SAE] Layer 19: Using positive feature 324 with coefficient 36.2297 (corr=0.7207) [SAE] Layer 20: Using positive feature 14645 with coefficient 14.6941 (corr=0.7253) [SAE] Layer 21: Using positive feature 7129 with coefficient 34.3992 (corr=0.7407) [SAE] Layer 22: Using positive feature 3311 with coefficient 19.4628 (corr=0.7797) [SAE] Layer 23: Using positive feature 11246 with coefficient 63.2732 (corr=0.7479) [SAE] Layer 24: Using positive feature 12433 with coefficient 61.2026 (corr=0.7288) [SAE] Layer 25: Using positive feature 3912 with coefficient 59.4457 (corr=0.7381) [SAE]
Seonglae Cho