FaithfulSAE Cameraready


<Faithful SAE Paper>

여기 논문에 추가실험 할건데


matched feature case study self explainabtion - best and worst features eluether embedding
즉
Self exaplanation pipeline -> top matched 10
Top activated window
Eluether embedding
이거할거야

구현은 feat_match 여기 파일에
1. feature matching 하면서 가장 점수높은 10개의 feature 쌍 찾는다.

그렇다면 
이놈들은 높은 확률로 같은 feature 를 의미하겟찌?


self explanation 생성은 steer로 하는거라 @cross_dataset_metrics.py  참고해 그리고
일단 list 로 feature 정보 저장할거야
{
  similarity: 0.56
  eleuthers: [0.6, 0.6],
  self_explanations: list[str]
  indices: [5234, 3523]
}

json 은 적절하게 이렇게 해서 만들고

@cross_dataset_metrics.py 에도 함수 만들어서 호출하는데 그게 

 index 별로 실행 이 아니라 한번 실행해서 하나늬 sae1 sae2 에 대해 총두번 실행해서 해당 layer index 10개 activation 만 모아서 window 로 해서 보여주면댐
top_activated_windows(feature, window=30)
 
기준은 mean, density, max 이렇게 해서 top 10 개 총 30개 

{
  mean: List[ActExample]
  density: List
  max ...
}

density 경우는 window 에서 가장 토큰 대비 0아닌 비율이야

class ActExample(BaseModel):
  tokens: [int]
  texts: [str]
  acts: [float]


알겠제 마찬가지로 eleuther() 도 이렇게   sae별로 1번씩 호출하는데

elutehr embedding metric 이 뭐냐하면


## Embedding based Scoring

Compares the [Text **embedding**](https://www.notion.so/Text-embedding-f592251a59af4e5b8a7a70c255546ef7?pvs=21) between the top/least activation samples with the explanation text based on [Cosine Similarity](https://www.notion.so/Cosine-Similarity-0686a915d71749b6aecbe611fd0f3efd?pvs=21). 

It serves as a good metric for evaluating feature performance and improving the linear mismatch between low and high activations.

A.4.7 DISAGREEMENT BETWEEN SCORES
In this section, we will discuss how different types of scores might evaluate explanations differently.
In particular, we will focus on an interpretation with fuzzing score but low detection score, high
fuzzing score but low simulation score, high simulation score but low embedding score and high
embedding score and low simulation score.
For instance, feature 2 of layer 8 of the 131k latent SAE model trained on the residual stream of
Gemma 2 9b has a fuzzing score of 0.9 but a detection score of only 0.43. The interpretation given
by our pipeline is ”Verbs that link a subject to additional information, often in a formal or descrip22
Table A2: Pearson correlation computed over 800 different latent scores
Fuzzing Detection Simulation Embedding Surprisal
Fuzzing 1 0.74 0.74 0.42 0.32
Detection 1 0.46 0.70 0.62
Simulation 1 0.30 0.14
Embedding 1 0.79
Surprisal 1
tive tone.” Most of the activations of this latent are on sentences like ”This <is needed> because
Magento (...)” or ”For instance, this< is> technically correct syntax”, where <> represent active
tokens. In both these examples, the model correctly identifies the highlighted tokens as active during fuzzing, but incorrectly identifies the context as non-active during detection. On the other hand,
detection incorrectly identifies sentences like ”pay for things that would prevent larger issues down
the road is better in the long run.” or ”Neuroscientist Jack Gallant calls the research a technologic
tour de force and says the ultimate decoder would provide vivid” as active. This explanation is too
vague on the activating context, leading to a low detection score, but specific enough in the types of
tokens that are active, leading to a high fuzz score.
Feature 281 of the same layer has a fuzzing score of 0.97, but a simulation score of only 0.19. Its
interpretation is ”URLs or hyperlinks containing query parameters, indicating a request for specific
data or actions on a web page.” The simulation score is lower for than the fuzzing score because the
model has to decide which parts of the URL the latent is active on, while the fuzzing scorer either
is shown links that are highlighted which it will say are active or non-links that are highlighted and
are, most likely, not active.
Feature 8 of layer 24 has a simulation score of 0.73, but an embedding score of only 0.43. Its
interpretation is ”Prevalence of logical operators and conjunctions in text, including simple addition
and conjunction in various contexts, as well as indicators of contrasting or additional information,
often used for comparison or to provide supplementary details”. While doing simulation, the model
correctly identifies that in the sentence ”the cost of the unit itself, plus installation and construction
costs.”, the token ”plus” is active but all the others aren’t. The same is true for sentences like ”Plus
my potato bread I made on the ANZAC Day” and ”Cliff Chiang on reintroducing Orion in Wonder
Woman! Plus interviews”. On the other hand, these sentences don’t have high embedding similarity
with the explanation.
Feature 10 of layer 16 has the opposite situation happening, with an embedding score of 0.74 and
a simulation score of only 0.04. Its interpretation is ”Verbs, prepositions, and adverbs that connect
clauses or indicate direction, movement, or progression, often in a sequence of actions or events”.
In sentences like ”right, three months ¡to close down¿. There was dead silence when the message
was read. Everybody waited for Mr. Smith to speak. Mr. Gingham” the simulator model incorrectly
identifies ” waited” and the second ” to” as active, and misses the real activation in ”to close down”.
The same happens in ”This is ¡going¿ to ¡be¿ one ¡swinging¿ party. Especially if the guests survive
Uncle Wilhelm. I love weddings. I love going to them, I love being in”, where the simulator model
identified ”love being in” as well as ”love going to” as being active. The embedding model correctly
identified that most activating sentences have the mention of motion.
With these examples, we hope to demonstrate that these generated explanations are not perfect,
but that we have an easier understanding of their flaws by looking at which cases different scores
disagree on

3.3.4 EMBEDDING
Classifying between active and non-active contexts given a certain interpretation can also be seen
as using interpretations of latents as “queries” that should be able to retrieve relevant “documents”,
contexts where the latent is active, between non-relevant “documents”, non-activating contexts. This
way, we take a selection of activating and non-activating contexts that embedded by an encoding
transformer, and the similarity between the query and the documents is used as a classifier to distinguish between activating and non-activating contexts, and the score is given by the AUROC.
If the encoding model is small enough - we used a 400M parameter model - this technique is the
fastest and opens up the possibility to evaluate a larger fraction of the activation distribution. We
have seen that using a larger embedding model - 7B parameter model - didn’t significantly improve
the scores, see fig A3, although we believe that this approach was under-investigated. Details on the
prompt, on the embedding model and on the way to compute the score in Appendix A.4.4.


Self-Explanation 이거 맨처음 하는것도 중요한데


Brilliant and cheap method that puts placeholder X in the prompt and adds steering vector to the language model to generate self-explanation.

As the quality of explanations varies depending on the insertion vector scale, we combine **self-similarity** and **entropy** metrics to automatically search for the optimal scale.

Verification shows similar or superior interpretation accuracy compared to the [LLM Neuron explainer](https://www.notion.so/LLM-Neuron-explainer-d089a27e16834eacbe7044f9c0484688?pvs=21) method.

### Limitation

Model inherent bias significantly affects the quality of descriptions.

### Future Works

As expected, for [Single-token feature](https://www.notion.so/Single-token-feature-18bc3c96247d80bab638c74ab17b85ec?pvs=21) (one kind of Activating Tokens), it cannot generate descriptions that explain the token itself. However, this could actually be beneficial as it helps filter out such context-independent tokens.

Successful explanations show a certain threshold of cosine similarity (Self-Similarity) between the final layer residual vector and the original SAE feature, which can be used as a Failure Detection metric for SAE features.

<SAE Self Explanation LessWrong>


이거 참고해서 구현하면 댐
FaithfulSAE Cameraready

Recommendations