CorrSteer Math Theory

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 17 20:42
Editor
Edited
Edited
2025 Oct 17 20:47
Refs
Refs

V4

notion image
\onecolumn \section{Appendix} \subsection{Theoretical Foundations} \label{app:theory} We provide theoretical motivation for CorrSteer's design choices regarding sample requirements, correlation stability, and coefficient selection. These results explain observed empirical phenomena and guide hyperparameter selection, though formal optimality guarantees remain an open problem. \paragraph{Proposition 1 (Sparsity-Sample Relationship).} Let $X_j \sim \mathrm{Bernoulli}(s)$ denote whether feature $i$ activates on sample $j$, where $s$ is the activation frequency (sparsity). To obtain at least $k$ positive activations with confidence $1-\delta$, a sufficient sample size is $N = \Theta(k/s)$. \textbf{Justification:} The expected number of activations in $N$ samples is $\mathbb{E}[\sum_{j=1}^N X_j]=Ns$. By Hoeffding's inequality, \[ \Pr\!\left[\sum_{j=1}^N X_j < k\right] \le \exp\!\Big(-\frac{2(Ns-k)^2}{N}\Big) = \exp\!\Big(-2N\big(s-\tfrac{k}{N}\big)^2\Big). \] Setting this probability $\le \delta$ and solving for $N$ yields $N \gtrsim k/s + \sqrt{\log(1/\delta)/(2s^2)}$, hence the \emph{sufficient sample size} is $N = \Omega(k/s)$ for fixed $\delta$. Tighter bounds exist, but this scaling relationship formalizes the intuition that lower sparsity demands proportionally more total samples. \paragraph{Proposition 2 (Correlation Estimation Variance).} For Pearson correlation estimated from $n$ active samples (where the feature actually fires), under normality assumptions, \[ \mathrm{Var}(r) \approx \frac{(1-r^2)^2}{n-3} \approx \frac{(1-r^2)^2}{n}. \] Since the number of active samples is $n = sN$, \[ \boxed{\mathrm{Var}(r) \propto \frac{1}{sN}}, \] implying that sparser features yield higher variance for fixed $N$. \textbf{Limitations:} This relies on Fisher's z-transformation approximation, which assumes bivariate normality and large $n$. SAE activations are sparse (mostly zeros), non-negative (ReLU), and potentially heavy-tailed, violating these assumptions. A more rigorous approach would employ bootstrap or permutation-based confidence intervals. However, the scaling relationship $\mathrm{Var}(r) \propto 1/n$ holds approximately in practice: correlation estimates stabilize beyond $N \approx 4000$ samples, and performance variance decreases predictably with sample size (\hyperref[fig:gemma2b_mmlu_progress]{\textcolor{linkblue}{Figure~\ref*{fig:gemma2b_mmlu_progress}}}). \paragraph{Coefficient Selection.} During inference, we apply additive steering: $z_{i,j} \to z_{i,j} + c_i$. Following prior work on activation steering~\citep{rimsky-etal-2024-steering, zhao-etal-2025-steering}, we adopt the average positive activation as the steering coefficient: \[ c_i = \mathbb{E}[z_{i,j} \mid y>0]. \] This choice is a practical heuristic rather than a theoretical optimum. Alternative formulations include contrastive steering (CAA)~\citep{rimsky-etal-2024-steering}, which uses $c_i = \mathbb{E}[z_{i,j} \mid y>0] - \mathbb{E}[z_{i,j} \mid y\le 0]$ but requires paired contrastive data unavailable for generation tasks. The average positive coefficient is simple, requires no hyperparameters, and performs comparably to or better than contrastive methods when applied to generation-time features (\hyperref[tab:gemma_accuracy]{\textcolor{linkblue}{Table~\ref*{tab:gemma_accuracy}}}). Mean-pooling mitigates outliers compared to max-pooling for long-generation tasks (Section~\ref{method:pooling}), and validation-based pruning (CorrSteer-P) provides empirical calibration when needed. \paragraph{Empirical Validation.} These propositions explain observed patterns in our experiments: \textbf{(1)} SimpleQA features with near-zero activation frequency show limited improvement due to insufficient effective samples (Proposition 1, \hyperref[fig:frequency-comparison]{\textcolor{linkblue}{Figure~\ref*{fig:frequency-comparison}}}); \textbf{(2)} GSM8K with only 1,000 samples exhibits high performance variance in CorrSteer-A (std=24.43 vs.\ 0.35 for non-steered), consistent with $\mathrm{Var}(r) \propto 1/N$ (Proposition 2, \hyperref[tab:gemma_accuracy]{\textcolor{linkblue}{Table~\ref*{tab:gemma_accuracy}}}); \textbf{(3)} HarmBench features with near-100\% activation frequency achieve stable performance with minimal samples (108 samples, low variance); \textbf{(4)} Mean-pooling for coefficient calculation outperforms max-pooling in long-generation tasks, as max-pooling produces outlier-sensitive coefficients (Proposition 3, Section~\ref{method:pooling}).

V3

notion image
\onecolumn \section{Appendix} \subsection{Theoretical Foundations} \label{app:theory} We provide theoretical motivation for CorrSteer's design choices regarding sample requirements, correlation stability, and coefficient selection. These results explain observed empirical phenomena and guide hyperparameter selection, though formal optimality guarantees remain an open problem. \paragraph{Proposition 1 (Sparsity-Sample Relationship).} Let $X_j \sim \mathrm{Bernoulli}(s)$ denote whether feature $i$ activates on sample $j$, where $s$ is the activation frequency (sparsity). To obtain at least $k$ positive activations with confidence $1-\delta$, a sufficient sample size is $N = \Theta(k/s)$. \textbf{Justification:} The expected number of activations in $N$ samples is $\mathbb{E}[\sum_{j=1}^N X_j]=Ns$. By Hoeffding's inequality, \[ \Pr\!\left[\sum_{j=1}^N X_j < k\right] \le \exp\!\Big(-\frac{2(Ns-k)^2}{N}\Big) = \exp\!\Big(-2N\big(s-\tfrac{k}{N}\big)^2\Big). \] Setting this probability $\le \delta$ and solving for $N$ yields $N \gtrsim k/s + \sqrt{\log(1/\delta)/(2s^2)}$, hence the \emph{sufficient sample size} is $N = \Omega(k/s)$ for fixed $\delta$. Tighter bounds exist, but this scaling relationship formalizes the intuition that lower sparsity demands proportionally more total samples. \paragraph{Proposition 2 (Correlation Estimation Variance).} For Pearson correlation estimated from $n$ active samples (where the feature actually fires), under normality assumptions, \[ \mathrm{Var}(r) \approx \frac{(1-r^2)^2}{n-3} \approx \frac{(1-r^2)^2}{n}. \] Since the number of active samples is $n = sN$, \[ \boxed{\mathrm{Var}(r) \propto \frac{1}{sN}}, \] implying that sparser features yield higher variance for fixed $N$. \textbf{Limitations:} This relies on Fisher's z-transformation approximation, which assumes bivariate normality and large $n$. SAE activations are sparse (mostly zeros), non-negative (ReLU), and potentially heavy-tailed, violating these assumptions. A more rigorous approach would employ bootstrap or permutation-based confidence intervals. However, the scaling relationship $\mathrm{Var}(r) \propto 1/n$ holds approximately in practice: correlation estimates stabilize beyond $N \approx 4000$ samples, and performance variance decreases predictably with sample size (\hyperref[fig:gemma2b_mmlu_progress]{\textcolor{linkblue}{Figure~\ref*{fig:gemma2b_mmlu_progress}}}). \paragraph{Proposition 3 (Average Positive Coefficient as Empirical Heuristic).} During inference, we apply additive steering: $z_{i,j} \to z_{i,j} + c_i$. Following prior work on activation steering~\citep{rimsky-etal-2024-steering, zhao-etal-2025-steering}, we adopt the average positive activation as the steering coefficient: \[ \boxed{c_i = \mathbb{E}[z_{i,j} \mid y>0]}. \] \textbf{Heuristic rationale (not proven):} This choice can be motivated by minimizing L2 deviation in activation space: \[ \mathcal{L}(c_i)=\mathbb{E}_{\text{test}}[(z_{i,j}+c_i-z_{i,j}^*)^2], \] where $z_{i,j}^*$ represents a hypothetical "ideal" activation for correct predictions. Under two strong assumptions---that positive training activations approximate ideal test behavior ($\mathbb{E}_{\text{test}}[z_{i,j}^*] \approx \mathbb{E}[z_{i,j} \mid y>0]$) and test baseline activations are near-zero ($\mathbb{E}_{\text{test}}[z_{i,j}] \approx 0$)---this yields $c_i^* = \mathbb{E}[z_{i,j} \mid y>0]$. \textbf{Critical limitations:} \textbf{(1)} The connection between L2 deviation in activation space and task performance improvement is \emph{assumed}, not proven. We lack theoretical justification for why minimizing activation deviation should maximize task accuracy or minimize cross-entropy loss. \textbf{(2)} The "ideal activation" $z_{i,j}^*$ is unobservable and conceptually circular (we define it as what leads to correct predictions, then use correct predictions to estimate it). \textbf{(3)} Alternative formulations exist: contrastive steering (CAA)~\citep{rimsky-etal-2024-steering} uses $c_i = \mathbb{E}[z_{i,j} \mid y>0] - \mathbb{E}[z_{i,j} \mid y\le 0]$, which does not require the zero-baseline assumption but needs paired contrastive data. \textbf{Empirical justification:} Our approach is best understood as an empirically validated heuristic rather than a principled optimum. The average positive coefficient is simple, requires no hyperparameters, and performs comparably to or better than contrastive methods when applied to generation-time features (\hyperref[tab:gemma_accuracy]{\textcolor{linkblue}{Table~\ref*{tab:gemma_accuracy}}}). \textbf{Failure modes and mitigation:} If (A1) fails (train/test mismatch), the coefficient may not generalize. If (A2) fails (feature already active), steering may overshoot, causing performance degradation. When these conditions are partially violated, empirical calibration via scalar adjustment $\alpha c_i^*$ can restore performance, as observed in validation-based pruning (CorrSteer-P). Mean-pooling further mitigates outliers compared to max-pooling (Section~\ref{method:pooling}), providing robustness. \paragraph{Empirical Validation.} These propositions explain observed patterns in our experiments: \textbf{(1)} SimpleQA features with near-zero activation frequency show limited improvement due to insufficient effective samples (Proposition 1, \hyperref[fig:frequency-comparison]{\textcolor{linkblue}{Figure~\ref*{fig:frequency-comparison}}}); \textbf{(2)} GSM8K with only 1,000 samples exhibits high performance variance in CorrSteer-A (std=24.43 vs.\ 0.35 for non-steered), consistent with $\mathrm{Var}(r) \propto 1/N$ (Proposition 2, \hyperref[tab:gemma_accuracy]{\textcolor{linkblue}{Table~\ref*{tab:gemma_accuracy}}}); \textbf{(3)} HarmBench features with near-100\% activation frequency achieve stable performance with minimal samples (108 samples, low variance); \textbf{(4)} Mean-pooling for coefficient calculation outperforms max-pooling in long-generation tasks, as max-pooling produces outlier-sensitive coefficients (Proposition 3, Section~\ref{method:pooling}).

V2

notion image
\subsection{Theoretical Foundations} \label{app:theory} We provide theoretical motivation for CorrSteer's design choices regarding sample requirements, correlation stability, and coefficient selection. These results explain observed empirical phenomena and guide hyperparameter selection, though formal optimality guarantees remain an open problem. \paragraph{Proposition 1 (Sparsity-Sample Relationship).} Let $X_j \sim \mathrm{Bernoulli}(s)$ denote whether feature $i$ activates on sample $j$, where $s$ is the activation frequency (sparsity). To obtain at least $k$ positive activations with confidence $1-\delta$, a sufficient sample size is $N = \Theta(k/s)$. \textbf{Justification:} The expected number of activations in $N$ samples is $\mathbb{E}[\sum_{j=1}^N X_j]=Ns$. By Hoeffding's inequality, \[ \Pr\!\left[\sum_{j=1}^N X_j < k\right] \le \exp\!\Big(-\frac{2(Ns-k)^2}{N}\Big) = \exp\!\Big(-2N\big(s-\tfrac{k}{N}\big)^2\Big). \] Setting this probability $\le \delta$ and solving for $N$ yields $N \gtrsim k/s + \sqrt{\log(1/\delta)/(2s^2)}$, hence the \emph{sufficient sample size} is $N = \Omega(k/s)$ for fixed $\delta$. Tighter bounds exist, but this scaling relationship formalizes the intuition that lower sparsity demands proportionally more total samples. \paragraph{Proposition 2 (Correlation Estimation Variance).} For Pearson correlation estimated from $n$ active samples (where the feature actually fires), under normality assumptions, \[ \mathrm{Var}(r) \approx \frac{(1-r^2)^2}{n-3} \approx \frac{(1-r^2)^2}{n}. \] Since the number of active samples is $n = sN$, \[ \boxed{\mathrm{Var}(r) \propto \frac{1}{sN}}, \] implying that sparser features yield higher variance for fixed $N$. \textbf{Limitations:} This relies on Fisher's z-transformation approximation, which assumes bivariate normality and large $n$. SAE activations are sparse and non-negative (ReLU), violating normality. However, the scaling relationship $\mathrm{Var}(r) \propto 1/n$ remains empirically valid for large samples. \paragraph{Proposition 3 (Average Positive Coefficient as Heuristic).} During inference, we apply additive steering: $z_{i,j} \to z_{i,j} + c_i$. If we model the steering objective as minimizing L2 deviation from an ideal activation pattern, \[ \mathcal{L}(c_i)=\mathbb{E}_{\text{test}}[(z_{i,j}+c_i-z_{i,j}^*)^2], \] the optimal coefficient is $c_i^* = \mathbb{E}_{\text{test}}[z_{i,j}^*] - \mathbb{E}_{\text{test}}[z_{i,j}]$. \textbf{Heuristic assumptions:} \textbf{(A1)} \emph{Generalization:} Positive training samples approximate desired test behavior: $\mathbb{E}_{\text{test}}[z_{i,j}^*] \approx \mathbb{E}_{\text{train}}[z_{i,j} \mid y>0]$. This assumes positive training samples generalize to test, which may fail under distribution shift. \textbf{(A2)} \emph{Baseline sparsity:} Test samples have low baseline activation: $\mathbb{E}_{\text{test}}[z_{i,j}] \approx 0$. This holds when: (i) ReLU induces sparsity (most features inactive most of the time), or (ii) test samples lack the task-relevant pattern until steered. This fails when test samples already strongly activate the feature. Under (A1) and (A2), \[ \boxed{c_i^* \approx \mathbb{E}[z_{i,j} \mid y>0]}, \] the average positive activation. \textbf{Why L2 and not cross-entropy?} We steer intermediate activations (residual stream), not final logits. The connection between activation steering and task loss is indirect. L2 deviation serves as a tractable proxy for "deviation from desired activation state." This is a \emph{heuristic}, not a theorem; empirical validation is essential. \textbf{Failure modes and mitigation:} If (A1) fails (train/test mismatch), the coefficient may not generalize. If (A2) fails (feature already active), steering may overshoot, causing performance degradation. When these conditions are partially violated, empirical calibration via scalar adjustment $\alpha c_i^*$ can restore performance, as observed in validation-based pruning (CorrSteer-P). Mean-pooling further mitigates outliers compared to max-pooling (Section~\ref{method:pooling}), providing robustness. \paragraph{Empirical Validation.} These theoretical results directly explain several empirical observations: \textbf{(1)} SimpleQA features with near-zero activation frequency show limited improvement due to insufficient effective samples (Theorem 1); \textbf{(2)} GSM8K with only 1,000 samples exhibits high variance in CorrSteer-A performance (Theorem 2, \hyperref[tab:gemma_accuracy]{\textcolor{linkblue}{Table~\ref*{tab:gemma_accuracy}}}); \textbf{(3)} HarmBench features with near-100\% activation frequency achieve stable performance with minimal samples; \textbf{(4)} Mean-pooling for coefficient calculation outperforms max-pooling in long-generation tasks, as max-pooling produces outlier-sensitive coefficients that violate L2 optimality (Theorem 3, Section~\ref{method:pooling}).
 
 
 
 
 

V1

notion image
\subsection{Theoretical Foundations} \label{app:theory} \paragraph{Theorem 1 (Sparsity–Sample Relationship).} Let $X_j \sim \mathrm{Bernoulli}(s)$ denote activation of feature $i$ on sample $j$. The expected number of activations in $N$ samples is $\mathbb{E}[\sum_{j=1}^N X_j]=Ns$. By Hoeffding's inequality, \[ \Pr\!\left[\sum_{j=1}^N X_j < k\right] \le \exp\!\Big(-\frac{2(Ns-k)^2}{N}\Big) = \exp\!\Big(-2N\big(s-\tfrac{k}{N}\big)^2\Big). \] For confidence $1-\delta$, it suffices that \[ Ns \ge k + \sqrt{\frac{N\log(1/\delta)}{2}} \;\Rightarrow\; \boxed{N = \Theta(k/s)}. \] Hence, low sparsity $s$ requires proportionally more total samples. \paragraph{Theorem 2 (Correlation Estimation Variance).} For Pearson correlation estimated from $n$ effective (active) samples, \[ \mathrm{Var}(r) \approx \frac{(1-r^2)^2}{n}. \] Since $n \approx sN$ by Theorem~1, \[ \boxed{\mathrm{Var}(r) \approx \frac{(1-r^2)^2}{sN}}, \] implying that sparser features yield higher variance for fixed $N$. \paragraph{Theorem 3 (Average Positive Coefficient Optimality).} Consider feature $i$ with activations $z_{i,j}$ on sample $j$. During inference, we apply additive steering: the steered activation becomes $z_{i,j} + c_i$ where $c_i$ is the steering coefficient. Define the L2 loss over test samples with unknown labels: \[ \mathcal{L}(c_i) = \mathbb{E}_{\text{test}}\!\big[(z_{i,j} + c_i - z_{i,j}^*)^2\big], \] where $z_{i,j}^*$ is the ideal activation for correct prediction on sample $j$. Minimizing $\mathcal{L}$ yields \[ \frac{\partial \mathcal{L}}{\partial c_i} = 2\mathbb{E}_{\text{test}}[z_{i,j} + c_i - z_{i,j}^*]=0 \;\Rightarrow\; c_i^* = \mathbb{E}_{\text{test}}[z_{i,j}^*] - \mathbb{E}_{\text{test}}[z_{i,j}]. \] \textbf{Key assumptions:} (1) The ideal activation on positive samples approximates the desired test behavior: $\mathbb{E}_{\text{test}}[z_{i,j}^*] \approx \mathbb{E}[z_{i,j} \mid y>0]$. (2) Test samples have low baseline activation: $\mathbb{E}_{\text{test}}[z_{i,j}] \approx 0$ (due to ReLU sparsity or task mismatch). Under these assumptions, \[ \boxed{c_i^* = \mathbb{E}[z_{i,j} \mid y>0] - 0 = \mathbb{E}[z_{i,j} \mid y>0]}, \] the average positive activation, which minimizes expected L2 deviation. \paragraph{Empirical Validation.} These theoretical results directly explain several empirical observations: \textbf{(1)} SimpleQA features with near-zero activation frequency show limited improvement due to insufficient effective samples (Theorem 1); \textbf{(2)} GSM8K with only 1,000 samples exhibits high variance in CorrSteer-A performance (Theorem 2, \hyperref[tab:gemma_accuracy]{\textcolor{linkblue}{Table~\ref*{tab:gemma_accuracy}}}); \textbf{(3)} HarmBench features with near-100\% activation frequency achieve stable performance with minimal samples; \textbf{(4)} Mean-pooling for coefficient calculation outperforms max-pooling in long-generation tasks, as max-pooling produces outlier-sensitive coefficients that violate L2 optimality (Theorem 3, Section~\ref{method:pooling}).
 
 
 
 

Recommendations