CRL Result

feature interpretability 안좋으면 포커싱을 interpretability 가 아니라 fine tuning dynamics 분리나 다른 걸로

모두 같은 레이어에서 진행했고

64.25% crl

bbq ambig random 58.39% 58.35%, 58.34% 58.37% 58.33% 58.42%

mask random 60.17% 60.16%

top 59.69% 60.19

bbq disambig random mask 84.04% 84.64% nonmask 79.55% 84.65%

top 84.53% 80.69%

harmbench random 45.71% 45.00% (47.14% 47.50%) top 45.71% 50.36% (48.57%)

random without generation 46.79% 47.14%

mmlu random 54.64% 54.74% top 54.55% 54.54%


 | Method | Non-steered | Random Feature | Random + AFM | Top Feature | CRL (Ours) |
|--------|-------------|----------------|--------------|-------------|------------|
| BBQ Ambig | $60.17_{\pm 0.01}$ | $58.36_{\pm 0.03}$ | $60.16_{\pm 0.01}$ | $59.94_{\pm 0.25}$ | $65.86_{\pm 3.03}$ |
| HarmBench | $41.46_{\pm 9.05}$ | $45.35_{\pm 0.35}$ | $46.96_{\pm 0.17}$ | $48.03_{\pm 2.32}$ | $49.12_{\pm 1.59}$ | 48.03 | 49.12 |



\begin{table}[h]
  \centering
  \caption{Performance results for Gemma 2 2B model across different tasks using \textbf{single-layer} CRL-Token. The table shows task type, intervention layer, baseline accuracy (Before), CRL accuracy (After), and improvement in percentage points.}
  \label{tab:gemma_results}
  \begin{tabular}{lccccc}
  \toprule
  \textbf{Task} & \textbf{Type} & \textbf{Layer} & \textbf{Before} & \textbf{After} & \textbf{Improvement} \\
  \midrule
  MMLU & Multi-choice QA & 24 & 51.90,52.23 & 55.19,55.45,55.48  +3.29 \\
  MMLU-Pro & Multi-choice QA & 25 & 30.30+-00 & 30.44,30.54 & +0.14 \\
  BBQ Ambig & Bias QA (ambiguous) & 5 & 60.18,60.16 & 63.71,68.00 & +3.55 \\
  BBQ Disambig & Bias QA (disambiguated) & 5 & 84.75,84.01 & 84.95 & +0.94 \\
  SimpleQA & Short-form QA & 8 & 3.78+-0.17 & 3.76,3.93,4.32 & +0.13 \\
  GSM8k & Math reasoning & 24 & 54.51,54.74 & 55.88, 55.42 & +1.14 \\
  HarmBench & Adversarial safety & 21 & 31.25,48.50,44.64 & 50.25,48.00 & +5.61 \\
  XSTest & Over-refusal & 12 & 86.35 & 86.98,88.57, & +0.63 \\
  \bottomrule
  \end{tabular}
\end{table}

3d 시각화 like

Personal Paper Visualization