CRL Result

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 May 2 0:51
Editor
Edited
Edited
2025 Dec 2 0:14
Refs
Refs
CRL Train

feature interpretability 안좋으면 포커싱을 interpretability 가 아니라 fine tuning dynamics 분리나 다른 걸로

모두 같은 레이어에서 진행했고
  • 64.25% crl
  • bbq ambig random 58.39% 58.35%, 58.34% 58.37% 58.33% 58.42%
    • mask random 60.17% 60.16%
  • top 59.69% 60.19
  • bbq disambig random mask 84.04% 84.64% nonmask 79.55% 84.65%
    • top 84.53% 80.69%
  • harmbench random 45.71% 45.00% (47.14% 47.50%) top 45.71% 50.36% (48.57%)
    • random without generation 46.79% 47.14%
  • mmlu random 54.64% 54.74% top 54.55% 54.54%
 
| Method | Non-steered | Random Feature | Random + AFM | Top Feature | CRL (Ours) | |--------|-------------|----------------|--------------|-------------|------------| | BBQ Ambig | $60.17_{\pm 0.01}$ | $58.36_{\pm 0.03}$ | $60.16_{\pm 0.01}$ | $59.94_{\pm 0.25}$ | $65.86_{\pm 3.03}$ | | HarmBench | $41.46_{\pm 9.05}$ | $45.35_{\pm 0.35}$ | $46.96_{\pm 0.17}$ | $48.03_{\pm 2.32}$ | $49.12_{\pm 1.59}$ | 48.03 | 49.12 |
\begin{table}[h] \centering \caption{Performance results for Gemma 2 2B model across different tasks using \textbf{single-layer} CRL-Token. The table shows task type, intervention layer, baseline accuracy (Before), CRL accuracy (After), and improvement in percentage points.} \label{tab:gemma_results} \begin{tabular}{lccccc} \toprule \textbf{Task} & \textbf{Type} & \textbf{Layer} & \textbf{Before} & \textbf{After} & \textbf{Improvement} \\ \midrule MMLU & Multi-choice QA & 24 & 51.90,52.23 & 55.19,55.45,55.48 +3.29 \\ MMLU-Pro & Multi-choice QA & 25 & 30.30+-00 & 30.44,30.54 & +0.14 \\ BBQ Ambig & Bias QA (ambiguous) & 5 & 60.18,60.16 & 63.71,68.00 & +3.55 \\ BBQ Disambig & Bias QA (disambiguated) & 5 & 84.75,84.01 & 84.95 & +0.94 \\ SimpleQA & Short-form QA & 8 & 3.78+-0.17 & 3.76,3.93,4.32 & +0.13 \\ GSM8k & Math reasoning & 24 & 54.51,54.74 & 55.88, 55.42 & +1.14 \\ HarmBench & Adversarial safety & 21 & 31.25,48.50,44.64 & 50.25,48.00 & +5.61 \\ XSTest & Over-refusal & 12 & 86.35 & 86.98,88.57, & +0.63 \\ \bottomrule \end{tabular} \end{table}
3d 시각화 like
Personal Paper Visualization

Steer RL Experiment Dataset

SAE feature set in layer is more important than critic network
  • HumanEval 164 samples for test universality

reasoning

  • AIME 2024
  • GPQA Diamond
  • multi layer decode 는 16프로로 → 첫레이어만 하자
    • 첫번째 레이어 amplified bias 더해주고 나머지는 direction 만 하기 항상 첫 레이어 더해주면 decode 명시 할때 한 레이어만 amplified decoding
      상식적으로는 nonshared 긴 한데 shared 가 좋게 나오면 좋은 이유 생각해보기 denoising overfitting?
    • shared?
    • nonshared?

Critic analysis

뒤로 갈수록 loss 낮음

multi layer control 관찰

  • shared 가 훨신 높은 validation score
초반 레이어에서 덜 다양한 feature 사용함 - 이런 정량적 분석 추가할거 매우많다 성능 말고도
  • 연속적인 레이어에서 더 높은 성능과 너 높은 activation 조정 요구
  • 초반보다는 후반 레이어가 효율적
  • multi k
    • 딱히 차이 없었다 combination 으로 action space 넓어진다기 보단 그냥 비슷함
    • substract / minux
      • 성능 망가짐
AI Control with RL Results
 
 
 
 
notion image
notion image
notion image
notion image
notion image
notion image
 
 
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
 
  • Agent Graph - Demo AAAI
  • CRL - ICLR , Neurlips workshop Auguest
  • CorrSteer -
 
 
 
 

Recommendations