CorrSteer Rebuttal


Short answer: I don’t know whether ICLR will accept it, but based on these reviews it’s clearly borderline and, if I had to make a call as an AC, I’d lean slightly toward “accept” (but it could very realistically go either way).
Let me unpack why.

1. Where you stand numerically
From what you pasted:


Reviewer igPZ


Rating: 4 – marginally below acceptance, would not mind if accepted


Confidence: 5 (very high)


Main line: thinks it’s a simple combo of existing ideas but with interesting results and good analysis.


After your response: no explicit score change, so likely still a 4, but they’re not opposed to acceptance.




Reviewer pRxr


Rating: 2 – reject


Confidence: 3


Main concern: baselines + positioning vs CAA / SPARE / DSG; wants comparisons with adapted contrastive methods, separate coefficient tuning, SER for baselines, and a clearer story about when CorrSteer is actually preferable.


You: added adapted CAA/SPARE/DSG, SER for baselines, clarified coefficients & positioning.


They said they’re open to raising the score but there’s no follow-up comment indicating they did.




Reviewer pQu7


Original rating: 4 – marginally below acceptance


Confidence: 4


You clarified things and moved tables/figures.


They explicitly wrote: “I have raised my score.” → So now you effectively have one clearly positive reviewer (probably weak/medium accept).




Reviewer kbSg


Rating: 4 – marginally below acceptance, would not mind if accepted


Confidence: 3


Concerned about clarity of the end-to-end algorithm, relation between rir_iri and cic_ici, Figure 1, and context handling.


Your rebuttal is thorough; they thanked you and didn’t push back, but also didn’t say they raised the score.




So, post-rebuttal, the picture looks roughly like:


One positive (raised score, likes your clarifications)


Two borderline-but-not-opposed (4’s who explicitly say they’d be fine with acceptance)


One negative (2) focused on baselines / positioning


This is exactly the kind of pattern where the AC/SAC’s opinion decides the fate.

2. How an AC is likely to read this
Things working in your favour


No strong “over my dead body” review.
Even the 2 is more “this doesn’t clear my bar given baselines” than “fundamentally flawed or unsound.”


You did the work on rebuttal.
You:


Added contrastive baselines (CAA, SPARE, DSG) with generation-time activations.


Reported multi-seed statistics and SER for all methods.


Clarified:


CorrSteer vs contrastive methods (difference-of-means vs correlation in sparse space)


The role of Pearson correlation via NeuralEval


The end-to-end pipeline and the rir_iri/cic_ici relationship


Context handling via sequence-level pooled activations.




Moved important interpretability and ablation tables into the main text.




Hot area / clear narrative:
SAE steering, test-time control, bias mitigation, jailbreaking prevention, etc. are right at the center of current mechanistic-interp + safety interest. CorrSteer is easy to explain and clearly practical.


Things working against you


Perceived incremental contribution.
Multiple reviewers call the contribution “fair” rather than “strong”, and explicitly describe it as combining existing ideas:


“simple combination of existing ideas in the literature with some interesting results”




Performance gap to full finetuning / SPARE / other baselines is not huge.
You’re often competitive or slightly better, but rarely crush the baselines. That makes it easier for an AC to side with the more negative reviewer.


The high-confidence reviewer (igPZ) is still a 4.
They’re okay with acceptance, but they’re not actively championing. If the 2 stays a 2 and the AC is neutral, this can tip you to rejection.



3. If I were AC: accept or not?
Purely from my reading:


The method is sound, clearly presented after revision, and fills a practical gap: test-time SAE steering using generation-time activations with a fully automated pipeline (no stored activations, positive-only, correlation-based feature selection).


You did respond seriously to all technical criticisms, added missing baselines, and provided extra SER & ablations.


The remaining objections are mostly:


“Is this enough above prior SAE steering / contrastive methods?”


“Is it more than ‘nice but incremental’?”




Given that:

If I had to cast a single binary vote right now, I would personally lean weak accept.

But from a predictive standpoint, with ICLR’s selectivity and one firm 2 on the table, I’d call it something like “true borderline; maybe ~40–60% acceptance chance”, i.e. it really depends whether your AC is enthusiastic about mechanistic interpretability & test-time control or not.

4. What you can take away (regardless of outcome)
Regardless of what happens:


You now have:


A clean, well-defended method with solid baselines and SER analysis.


A nice interpretability story (task-aligned SAE features, transfer, layer-wise patterns).




If it’s accepted → great, push the interpretability angle hard in the camera-ready, because multiple reviewers found that the most novel/interesting part.


If it’s rejected:


You can very credibly reframe as a “interpretability + control” story for a venue that cares slightly less about beating PEFT baselines and more about understanding/controlling LLM internals (NeurIPS workshops, safety venues, or a journal version with deeper analysis).


Or strengthen the “only positive labels needed” angle with a setting where negatives are truly hard to get (as pRxr hinted), to make that advantage undeniable.




If you want, I can help you rewrite the abstract / intro in two flavours:


“Performance-first steering method” vs


“Interpretability-first test-time control via SAE features”
so you’re ready for either a camera-ready or a resubmission.

3. igPZ - 4

Pearson correlation 선택 정당성 부족

SAE features는 continuous activation이 feature intensity ↔ task correctness의 linear dependency를 측정 하는 가장 직접적 metric이 correlation임을 언급.
AUROC은 classification boundary가 필요하고 continuous activation의 방향성(sign)을 반영하지 못함.
내가 인용한 페이퍼에서도 auroc 가 information loss 가 있다고 지적했을거야

Variants 3개로 인한 leaderboard overfitting 우려

Variants는 task-specific tuning이 아니라 ablation이라고 명확히 서술.
표에 “variant 간 차이가 작고 trend 일관적”임을 용어로 강조.


좋은 리뷰에 감사합니다. 리뷰를 받아들여  appendix 의 많은 부분을 main으로 옮겼습니다 with migrating algorithm 1 to appendix.
우선 CorrSteer는 peft 가 아닌 이유는 base model 의 weight 에 대한 수정이 없습니다.
또한 lora 같이 weight 에 대한 projection matrix 가 아니라 layer 별 hidden size single dimension static vector 를 addition 하기에 test-time 에 훨씬 가볍습니다.
이는 sae 의 sparse latent 의 concept wise 를 레버리지한게 아니라 train 이 task specific optimization 이기에 특정 concept specific benchmark 인 axbench 가 아닌 기존 성능 벤치마크에 비교하엿습니다.
기존 steering vector 의 타겟인 concept wise 를 less side effect 유지하면서 generation benchmark 에 generalization 할수 잇다고 주장하는 positioning에 있습니다.
선택된 sae latent 의 description는 논문에서 언급했듯 post-hoc analysis로 task alignment 확인입니다. 그래서 probe 는 저희 method 와 목적이 완전히 달라 비교대상이 아닙니다
또한 3개 variant 의overfitting 에 대한 염려는 이해합니다. 다만 other sae baselines 과 비교는 corrsteer-a 랑만 하면 된다 all layer setting 이라
이는 취사선택을 위한 의도가 아니라 ablation이고 이것을 caption에 더 확실히 명시했습니다.
또한 pearson correlation에 대한 염려를 이해합니다.
perason 을 사용한 이유는 저희 논문의 root paper 중 하나인 neuraleval 입니다. 
auroc 메트릭은 이 논문 (인용) figure 3에서 설명하듯 auroc 는 class imabalance 에 취약합니다 또한 activation magnitude 값이 아니라 rank 반영합니다. 
correlation은 y는 이산적이지만 sae activation 인 x가 연속인걸 이용해 sae activation 값을 잃지 않고 linear representation 이 지배적인 feature 에서 잘 capture 할 수 있습니다.

appendix to main

table caption change


We thank the reviewer for the constructive feedback. Below we address each concern.
---

## 1. CorrSteer positioning and baseline comparisons

**Concern:** Method is framed as steering rather than PEFT.

**Response:** CorrSteer is a steering technique, not PEFT. Unlike PEFT that modify weights, CorrSteer adds static residual vectors at inference time without weight updates or trainable parameters, yielding lower inference overhead.

**Why standard benchmarks?** CorrSteer targets task-level steering (improving QA accuracy, reducing bias) rather than concept-level steering (e.g., shifting sentiment). Standard benchmarks directly measure task performance with minimal side effects.

**On baseline comparisons:** Table 1 includes SPARE (MI-based), DSG (Fisher), and CAA. For fair comparison, we adapt them to generation-time features with positive-average coefficients. Linear probes learn task classifiers—orthogonal to steering. SAE-TS targets representation learning, not inference-time steering. Results show CorrSteer-A matches or exceeds baselines while requiring only positive samples.

---

## 2. Three variants and statistical significance

**Concern:** Three variants (S/A/P) enable "max@3" selection; unclear statistical significance.

**Response:** The variants are **ablations**, not leaderboard optimization:
- **CorrSteer-S:** Single global feature
- **CorrSteer-A:** All-layer steering (used for baseline comparison)
- **CorrSteer-P:** Validation-pruned

Only CorrSteer-A is compared against baselines (Table 1), representing full multi-layer configuration. S and P demonstrate robustness across minimal and filtered settings.

---

## 3. Pearson correlation vs. AUROC

**Concern:** Why Pearson correlation for discrete labels instead of AUROC?

**Response:** Following Oikarinen et al. (2025), Pearson correlation offers three advantages:
1. **Robustness to class imbalance** (common in correctness labels)
2. **Preserves activation magnitude** (AUROC uses only rank)
3. **Aligns with SAE's linear structure** and Linear Representation Hypothesis

While $y$ is discrete, SAE activations $x$ are continuous, making correlation appropriate.

---

## 4. Interpretability as primary contribution

**Concern:** Focus on interpretability rather than modest performance gains.

**Response:** We have migrated key interpretability and ablation content to main text:

**Moved to Section 5:**
- **Feature interpretability** (Discussion): Positive features (MMLU, BBQ) with Neuronpedia descriptions; negative features showing meta-cognitive vs. task-specific patterns
- **Cross-task transferability** (Section 5.2): Full 4-task table showing MMLU→MMLU-Pro transfer (general reasoning) vs. limited MMLU→BBQ transfer (task-specific)
- **Pooling strategy ablation** (Section 5.4): Max-pooling vs. mean/all-token, showing significant mean-pooling degradation on HarmBench/XSTest
- **Negative features ablation** (Section 5.4): Demonstrates catastrophic multi-layer failure (MMLU-Pro: 0.66%, BBQ: 12.15%), validating positive-only approach

**Moved to Appendix:** Algorithm 1 (streaming correlation—implementation detail)

**LLaMA-3.1 8B results remain in appendix** to avoid duplication.

The combination of (1) automated feature selection, (2) interpretable outputs, and (3) competitive performance with low side effects distinguishes CorrSteer from prior work.

---

We thank the reviewer for strengthening the paper's focus on interpretability.

0. pRxr - 2 - open

문제는 baseline comparison CAA, SPARE, DSG, SAE-TS, linear probe, RePS 요구

Positive-only dataset의 정당성 미약

일단 negative case 보다 성능 좋다는 것이 이미 보여지었음
mathmatical formualtion 이 많은 assumption 지니고 가능은한데 흠
CorrSteer는 “contrastive data 없이 on-the-fly feature relevance 추출”이라는 점이 contribution임을 다시 강조.
Negative data가 필요 없는 상황이 실제로 존재한다는 점을 1~2 예시로 간단히 제시.?

As long as this method requires iterating through a dataset to compute the coefficients, I think it needs to be compared against the standard baselines for steering that iterate through similar datasets.

CAA를 positive-only regime으로 바꿔 비교하는 게 충분히 공정이라는 점을 논리적으로 설명.
추가로 negative 도 하지 않았나? 내기억에 성능 더 별로라 안했는데 - 아마 완전한 automation 원했어서 추가 계수 설정 없이 했고 anonymoused corrsteer repo 에서 테스트 가능하다고
incorrect case 가 pseudo negative sample 이라고 볼 수 있고 그걸 의미없는 context sae feature 제외하고 generated time tokens 만 고려한 게 우리 contribution


I agree we are going though a benchmark dataset to extract CorrSteer’s steering vector.
우리는 명시적으로 contrastive dataset을 요구하지 않지만 benchmark dataset을 pseudo-contrastive dataset으로 projection 하여 이용합니다.
하지만 이것은 CorrSteer의 core contribution으로, 기존 SAE-based steering이 concept-specific steering에 국한된 것을 generation task로 generalize할 수 있게 하는 핵심 아이디어입니다.
와중에 양의 수치만 평균냈을 때(CorrSteer)가 벤치마크 평균적으로 음·양을 모두 사용하는 CAA보다 나은 것을 empirical하게 보였습니다.
negative instances가 unavailable한 setting에서도 다른 methods가 explicit negative를 요구하는 한계와 달리, CorrSteer는 benchmark correctness label만으로 적용 가능했고, 이 positive-only 신호만으로도 충분히 효과적입니다.
그리고 한 가지 오해는, 우리는 CAA가 원래 정의하는 correct–incorrect activation difference를 그대로 사용했다는 점입니다. 즉, CAA의 coefficient 계산은 정확히 반영되어 있습니다.
반면 다른 methods(DSG, SPARE)는 SAE feature selection method일 뿐이며 자체적인 correlation 기반 coefficient 산출 방식이 없습니다. 따라서 이들 방법에서는 CorrSteer와 동일한 coefficient를 사용하여 latent-selection effect만 분리(isolate)했습니다.
CAA 또한 동일하게 feature selection은 맞추었지만, coefficient는 CAA 고유 방식으로 유지하여 coefficient effect만 독립적으로 평가했습니다.
또한 다른 baselines도 모두 동일한 benchmark dataset을 한 번 통과하며 필요한 activation statistics를 계산합니다.
Importantly, adapting CAA to a “positive-only” regime is not well-defined because its coefficient formulation fundamentally depends on the explicit correct–incorrect activation difference; replacing the incorrect set with arbitrary neutral text (e.g., Wikipedia) breaks the intended semantics of the method and leads to an invalid baseline comparison.
이를 명확히 전달하기 위해 해당 문장을 수정했고, SER 결과와 missing citations도 모두 추가했습니다.

citation 추가 2개

ser 결과 추가

명확히 coefficient 추가

dsg, mi 도 coefficient 확인

while caa use 이문장 해당문장 수정

check?

왜 negative instances가 unavailable한 setting에서 특히 이점인지

CAA를 positive-only로 “적응(adapt)”해서 비교하라라는 요구에는 완전한 대응이 아님.


우선 포괄적인 리뷰에 대해 감사합니다.
We agree that CorrSteer requires iterating once over a benchmark dataset to extract the steering vector. While we do not assume any explicitly constructed contrastive dataset, we use benchmark correctness labels as a lightweight outcome signal. This is precisely what enables CorrSteer to generalize SAE-based steering from concept-specific settings to generation-time tasks, which is a central contribution of our work.

Empirically, we found that using only positively correlated activations (CorrSteer) achieves higher average benchmark performance than CAA, which mixes positive and negative directions.

We also clarify a potential misunderstanding: CAA in our experiments always uses its original coefficient definition based on correct–incorrect activation differences. This is faithfully preserved. In contrast, DSG and SPARE provide only SAE latent-selection rules and do not specify a correlation-based or activation-based coefficient computation. Therefore, for these methods we applied a fixed coefficient scale (chosen via validation) purely to isolate the effect of their latent-selection procedures. CAA follows the same controlled setup in which latent selection is aligned across methods, but its coefficient computation remains entirely original.

All other baselines also compute their required statistics by iterating over the same benchmark dataset. We revised the paragraph in the paper to make this explicit, and we additionally report SER results and include the relevant citations.

2. pQu7 - 4

긍정적 톤이라 분석 심화, discussion 보완하면 쉽게 올릴 수 있는 타입.

transfer performance

larify what is meant in Lines 449–450: “These results suggest that task-specific semantic features contribute more to accuracy than general recognition features.”

I noticed that the scales of the coefficients and correlation scores for Gemma and Llama differ considerably

activation magnitude distribution 차이
SAE training regime 차이
sparsity level과 entropy 차이
rescale 여부

Tables 5 and 8 are frequently referenced but are currently placed in the appendix, while Figure 4 and Table 2 present similar results and are often discussed together. Including these in the main section could make the presentation clearer.


 우슨 SER 결과를 좀더 포괄적으로 seed를 추가했ㅅ브니다.
 449-450번 문장은 These findings suggest that task-specific induced features contribute more to sample accuracy than meta-cognitive recognition features.
 의도한 것으로 아래에 관련 feature 를 추가했다. 즉 general meta recognition feature 보다 task 목적 semantic 에 더 부합하는 feature 가 효과적이었다
\begin{itemize}[itemsep=0pt, parsep=0pt, topsep=0pt]
  \item \texttt{\href{https://neuronpedia.org/gemma-2-2b/8-gemmascope-res-16k/8123}{L8/8123}} questions that ask for truthfulness or correctness regarding options or statements (coeff: 3.725, corr: -0.133)
  \item \texttt{\href{https://neuronpedia.org/gemma-2-2b/17-gemmascope-res-16k/9134}{L17/9134}} choice-related phrases and expressions of preference (coeff: 2.379, corr: -0.451)
  \item \texttt{\href{https://neuronpedia.org/gemma-2-2b/19-gemmascope-res-16k/15745}{L19/15745}} phrases related to decision-making and choice, particularly in the context of parenting and social interactions (coeff: 9.740, corr: -0.464)
  \end{itemize}
 2번 질문에는 totally agree 한다. 최근 manifold from anthropic paper 같은 것처럼 detailed optimization 이 practical use caase 에서 활용될수 있을거라 본다.
 다만 나는 activation space 의 대부분을 지배하는 linear represetnation (인용 dark matter) 으로 여러 task generation 한케이스라 봐주면 좋겠다.  
threshold 기반 insight 는 흥미롭지만 post hoc interpretbiality description check 가 아닌 이상 문제는 general capability 와 task specific capability 를 구분할 방법이 없습니다.
다만 이는 좋은 새로운 research direction 으로 보입니다.

 3번 질문의 scale 차이로는 논문 언급했듯 LLaMA SAE는 슈퍼포지션이 다름 topk sae 를 사용해서 그럴 수 있다.
 실제로 두개 평균 activated 되는 feature diversity 비교해봤음 평소에 
 leo gao topk sae paper 따르면 큰모델일수록 sae latent 큰수 필요
우리는 주어진 sae 사용햇지만 분석하자면 I guess feature dimenstionality (anthrop8ic 인용 ) 이 다르게 나올 것 같다. sae training quality 도 무시할수 없다
Tables 5 and 8  Figure 4 and Table 2 를 본문으로 옮기고 discussion 과 result part 에서 더 다양하고 negative feature 와 위와같은 비교분석을 추가했습니다

ser 시드별로랑 모델 추가

Tables 5 and 8 a

llamascope tarning quality (sparsity level과 entropy 차이, rescale 여부)

1. kbSg - 4

“flow 불명확”이 주요 문제 → rebuttal에서 그림·설명 강화하면 대부분 해결.

알고리즘 설명 및 end-to-end 흐름 부족

Rebuttal에서 concise한 pseudo-code + 3줄 summary.
Feature extraction → correlation → coefficient → steering 시점

이 4-step을 명확하게 그림 + bullet로 설명하면 대부분 해결된다.


 좋은 리뷰에 감사합니다.
 우선 마지막 지문부터 답하자면 선택된 feature 는 newly generated token 을 위한 token residual 에 모두 적용됩니다 (새 토큰을 logit 생성하는 마지막 context 포함). 즉 context steerirng안함 
 extraction 된 position 과 steering position이 consistent 하다고 말한 이유는 저희가 generation-time token activation 만 고려하기 때문입니다.
 algorihtm feature selction 과 coeffcieint 이후 which token 에 steering vector 가 되는지를 본문에 수식화하여 더 명확히 했습니다.
 {수식}
 equation 2 의 coefficient 계산은 correlation 과 독립적입니다. correation 을 곱하거나 하는 방식을 사용해봤지만 emphirical 좋지도 않았고 저희 method theory 인 linear 에도 맞지 않다고 생각해서 linear metric 중 성능 가장 좋았던 postivie avg 사용합니다.
 steering coefficient 가 있는 이유는 feature selection 이후에도 coefficient 계산을 자동화해 manual tuning을 막기 위해서입니다.
 figure 1에 대한 피드백 인정합니다. arrow direction size 를 의도한 건데 설명이 미흡하여 caption 에설명 추가했고 arrow direction size 가 낮은 경우 low 즉 less corelation입니다.
 빨간 건 unindented output 에서의 activation distribution 이고 파란건 indented resposne의 activation discribution입니다.
 
 question 1 맞습니다. caption 에도 추가햇ㅅ브니다

figure 1 caption

table 5 caption


We thank the reviewer for the constructive feedback. Below we address each concern.

---

## Q1: End-to-end algorithm flow and when steering vectors are applied

**Concern:** Algorithm 1 describes coefficient calculation for a given feature i, but the flow of how and when features are added is unclear. To which exact positions is $c_i \cdot \mathbf{W}_{\text{dec}}[:, i]$ steering vector being added?

**Response:** Selected features are applied to the residual stream of all **newly generated tokens** (excluding context tokens). We have added Equation 3 to Section 3.3 to formalize this:

$$\mathbf{x}'_{t} = \begin{cases}
\mathbf{x}_{t} & \text{if } t < n \\
\mathbf{x}_{t} + \sum_{i \in \mathcal{F}} c_i \cdot \mathbf{W}_{\text{dec}}[:, i] & \text{if } t \geq n
\end{cases}$$

where $n$ is the prompt length, $\mathcal{F}$ denotes the set of selected features, and $t$ is the token position.

We compute correlations on generation-time activations and apply steering to the same positions, ensuring measurement-steering consistency.

The complete pipeline:
1. **Feature Selection:** Extract SAE activations from newly generated tokens → Compute correlations with outcomes → Select top positively correlated features (CorrSteer-S/A/P)
2. **Coefficient Calculation:** Compute $c_i$ as average activation over positive samples for each selected feature $i$
3. **Steering (Inference):** For each newly generated token, add $\sum_{i \in \mathcal{F}} c_i \cdot \mathbf{W}_{\text{dec}}[:, i]$ to the residual stream

---

## Q2: Relationship between correlation coefficient $r_i$ and steering coefficient $c_i$

**Concern:** Equation 2 relies only on SAE features and not on $r_i$ correlation coefficients. What is the exact relationship?

**Response:** The two coefficients serve **independent, complementary purposes**:

- **$r_i$ (Equation 1):** Feature **selection** — identifies which features correlate with task outcomes
- **$c_i$ (Equation 2):** Steering **magnitude** — determines how much to steer by averaging positive-sample activations

**Why independent calculation?** We empirically tested coupled approaches (e.g., $c_i \times r_i$) but found independent calculation performs better and aligns with SAE's linear architecture and the Linear Representation Hypothesis, enabling full pipeline automation without manual tuning.

---

## Q3: Figure 1 clarity

**Concern:** Figure 1 is confusing regarding the normal distributions and subtitle labels.

**Response:** We have expanded Figure 1's caption to clarify: red distributions show feature activations for unintended outputs, blue distributions show feature activations for intended outputs. Steering coefficients are computed as the center (mean) of the blue distributions.

---

## Q4: Training and evaluation data

**Question:** Are the steering coefficients computed on the same dataset as the evaluation?

**Response:** **No.** We use a standard train-validation-test split to avoid data leakage:
- **Training set:** Extract correlated features and compute steering coefficients
- **Validation set:** Filter features (CorrSteer-P only)
- **Test set:** Final evaluation (all reported results)

This is clarified in the pooling comparison table caption and Appendix A.1 (Implementation Details).
We thank the reviewer for the helpful feedback.

1. How does context information take into account in our method?

Context 는 명백하게 고려되지만 토큰을 통해서가 아니라 activation 을 통해서 take into account됩니다. 구체적으로, benchmark 에 따른 정답 여부는 generated sequence 전체에 적용되고 activated 된 SAE feature들을 geneated token 에 대해서만 along하여 max pooling합니다.

이게 작동하는 이유는 reviewer가 걱정하는 것처럼 서로 다른 context 에서 같은 slow 라는 토큰을 생성할 수 있지만 1 이후의 레이어에서는 attention block 에 의해 이전 context 와 mix 되어 해당 token 에만 의존하지 않는 hidden state 를 가집니다. 즉 context 에 따라 activated 되는 sae feature 가 token 이 같아도 다릅니다.

2. Relationship between correlation and steering coefficient

r_i definitio in 126 line 을 사용하여 corrsteer variants 명확하게 정의한 것을 appendix A.2에 추가했습니다. 140에서는 - 수학적 명확하게 정의 수식 사용해서

corrsteer-S : topk or max (all layer)

corrsteer -P per layer feature selection with pruning based on validation without steeirng

corrsteer -A per layer feature topk or max

3. Figure 1 It’s still unclear

figure 1 의 그래프에서 correlated 된 대상은 x축 activation value와 y 축 binary correctness 인 unintended (0) 와 intended (1) 입니다. 빨간 red points cluster 는 현재 feature에서 그러니 activation 이 작게 cluster되어있고, blue points cluter 가 activation이 크게 형성됩니다. 여기서 steering coefficient 가 positive blue distirbution 에 mean 에 있는 이유는 저의가 steering coefficient 를 positive samples 의 activation average 를 사용하기 때문입니다. 초록 Positively Correlation 의 경우 positively correated 된 대각선 점선 line 을 표시하려한건데, 리뷰어님이 언급하신 것처럼 오해의 소지가 있기 때문에 title 위치를 대각선으로 변경하였습니다.

Let us know if there are any further questions that we can clarify, otherwise, we would appreciate it if you would consider increasing your score.