feature interpretability 안좋으면 포커싱을 interpretability 가 아니라 fine tuning dynamics 분리나 다른 걸로
모두 같은 레이어에서 진행했고
- 64.25% crl
- bbq ambig random 58.39% 58.35%, 58.34% 58.37% 58.33% 58.42%
- mask random 60.17% 60.16%
- top 59.69% 60.19
- bbq disambig random mask 84.04% 84.64% nonmask 79.55% 84.65%
- top 84.53% 80.69%
- harmbench random 45.71% 45.00% (47.14% 47.50%) top 45.71% 50.36% (48.57%)
- random without generation 46.79% 47.14%
- mmlu random 54.64% 54.74% top 54.55% 54.54%
| Method | Non-steered | Random Feature | Random + AFM | Top Feature | CRL (Ours) | |--------|-------------|----------------|--------------|-------------|------------| | BBQ Ambig | $60.17_{\pm 0.01}$ | $58.36_{\pm 0.03}$ | $60.16_{\pm 0.01}$ | $59.94_{\pm 0.25}$ | $65.86_{\pm 3.03}$ | | HarmBench | $41.46_{\pm 9.05}$ | $45.35_{\pm 0.35}$ | $46.96_{\pm 0.17}$ | $48.03_{\pm 2.32}$ | $49.12_{\pm 1.59}$ | 48.03 | 49.12 |
\begin{table}[h] \centering \caption{Performance results for Gemma 2 2B model across different tasks using \textbf{single-layer} CRL-Token. The table shows task type, intervention layer, baseline accuracy (Before), CRL accuracy (After), and improvement in percentage points.} \label{tab:gemma_results} \begin{tabular}{lccccc} \toprule \textbf{Task} & \textbf{Type} & \textbf{Layer} & \textbf{Before} & \textbf{After} & \textbf{Improvement} \\ \midrule MMLU & Multi-choice QA & 24 & 51.90,52.23 & 55.19,55.45,55.48 +3.29 \\ MMLU-Pro & Multi-choice QA & 25 & 30.30+-00 & 30.44,30.54 & +0.14 \\ BBQ Ambig & Bias QA (ambiguous) & 5 & 60.18,60.16 & 63.71,68.00 & +3.55 \\ BBQ Disambig & Bias QA (disambiguated) & 5 & 84.75,84.01 & 84.95 & +0.94 \\ SimpleQA & Short-form QA & 8 & 3.78+-0.17 & 3.76,3.93,4.32 & +0.13 \\ GSM8k & Math reasoning & 24 & 54.51,54.74 & 55.88, 55.42 & +1.14 \\ HarmBench & Adversarial safety & 21 & 31.25,48.50,44.64 & 50.25,48.00 & +5.61 \\ XSTest & Over-refusal & 12 & 86.35 & 86.98,88.57, & +0.63 \\ \bottomrule \end{tabular} \end{table}
3d 시각화 like Personal Paper Visualization
Steer RL Experiment Dataset
SAE feature set in layer is more important than critic network
- MMLU, MMLU Pro
- HumanEval 164 samples for test universality
reasoning
- AIME 2024
- GPQA Diamond
- multi layer decode 는 16프로로 → 첫레이어만 하자
- shared?
- nonshared?
첫번째 레이어 amplified bias 더해주고 나머지는 direction 만 하기 항상 첫 레이어 더해주면 decode 명시 할때 한 레이어만 amplified decoding
상식적으로는 nonshared 긴 한데 shared 가 좋게 나오면 좋은 이유 생각해보기 denoising overfitting?
Critic analysis
뒤로 갈수록 loss 낮음
multi layer control 관찰
- shared 가 훨신 높은 validation score
초반 레이어에서 덜 다양한 feature 사용함 - 이런 정량적 분석 추가할거 매우많다 성능 말고도
- 연속적인 레이어에서 더 높은 성능과 너 높은 activation 조정 요구
- 초반보다는 후반 레이어가 효율적
- multi k
- 딱히 차이 없었다 combination 으로 action space 넓어진다기 보단 그냥 비슷함
- substract / minux
- 성능 망가짐
AI Control with RL Results




















- Agent Graph - Demo AAAI
- CRL - ICLR , Neurlips workshop Auguest
- CorrSteer -
Seonglae Cho