<Faithful SAE Paper> 여기 논문에 추가실험 할건데 matched feature case study self explainabtion - best and worst features eluether embedding 즉 Self exaplanation pipeline -> top matched 10 Top activated window Eluether embedding 이거할거야 구현은 feat_match 여기 파일에 1. feature matching 하면서 가장 점수높은 10개의 feature 쌍 찾는다. 그렇다면 이놈들은 높은 확률로 같은 feature 를 의미하겟찌? self explanation 생성은 steer로 하는거라 @cross_dataset_metrics.py 참고해 그리고 일단 list 로 feature 정보 저장할거야 { similarity: 0.56 eleuthers: [0.6, 0.6], self_explanations: list[str] indices: [5234, 3523] } json 은 적절하게 이렇게 해서 만들고 @cross_dataset_metrics.py 에도 함수 만들어서 호출하는데 그게 index 별로 실행 이 아니라 한번 실행해서 하나늬 sae1 sae2 에 대해 총두번 실행해서 해당 layer index 10개 activation 만 모아서 window 로 해서 보여주면댐 top_activated_windows(feature, window=30) 기준은 mean, density, max 이렇게 해서 top 10 개 총 30개 { mean: List[ActExample] density: List max ... } density 경우는 window 에서 가장 토큰 대비 0아닌 비율이야 class ActExample(BaseModel): tokens: [int] texts: [str] acts: [float] 알겠제 마찬가지로 eleuther() 도 이렇게 sae별로 1번씩 호출하는데 elutehr embedding metric 이 뭐냐하면 ## Embedding based Scoring Compares the [Text **embedding**](https://www.notion.so/Text-embedding-f592251a59af4e5b8a7a70c255546ef7?pvs=21) between the top/least activation samples with the explanation text based on [Cosine Similarity](https://www.notion.so/Cosine-Similarity-0686a915d71749b6aecbe611fd0f3efd?pvs=21). It serves as a good metric for evaluating feature performance and improving the linear mismatch between low and high activations. A.4.7 DISAGREEMENT BETWEEN SCORES In this section, we will discuss how different types of scores might evaluate explanations differently. In particular, we will focus on an interpretation with fuzzing score but low detection score, high fuzzing score but low simulation score, high simulation score but low embedding score and high embedding score and low simulation score. For instance, feature 2 of layer 8 of the 131k latent SAE model trained on the residual stream of Gemma 2 9b has a fuzzing score of 0.9 but a detection score of only 0.43. The interpretation given by our pipeline is ”Verbs that link a subject to additional information, often in a formal or descrip22 Table A2: Pearson correlation computed over 800 different latent scores Fuzzing Detection Simulation Embedding Surprisal Fuzzing 1 0.74 0.74 0.42 0.32 Detection 1 0.46 0.70 0.62 Simulation 1 0.30 0.14 Embedding 1 0.79 Surprisal 1 tive tone.” Most of the activations of this latent are on sentences like ”This <is needed> because Magento (...)” or ”For instance, this< is> technically correct syntax”, where <> represent active tokens. In both these examples, the model correctly identifies the highlighted tokens as active during fuzzing, but incorrectly identifies the context as non-active during detection. On the other hand, detection incorrectly identifies sentences like ”pay for things that would prevent larger issues down the road is better in the long run.” or ”Neuroscientist Jack Gallant calls the research a technologic tour de force and says the ultimate decoder would provide vivid” as active. This explanation is too vague on the activating context, leading to a low detection score, but specific enough in the types of tokens that are active, leading to a high fuzz score. Feature 281 of the same layer has a fuzzing score of 0.97, but a simulation score of only 0.19. Its interpretation is ”URLs or hyperlinks containing query parameters, indicating a request for specific data or actions on a web page.” The simulation score is lower for than the fuzzing score because the model has to decide which parts of the URL the latent is active on, while the fuzzing scorer either is shown links that are highlighted which it will say are active or non-links that are highlighted and are, most likely, not active. Feature 8 of layer 24 has a simulation score of 0.73, but an embedding score of only 0.43. Its interpretation is ”Prevalence of logical operators and conjunctions in text, including simple addition and conjunction in various contexts, as well as indicators of contrasting or additional information, often used for comparison or to provide supplementary details”. While doing simulation, the model correctly identifies that in the sentence ”the cost of the unit itself, plus installation and construction costs.”, the token ”plus” is active but all the others aren’t. The same is true for sentences like ”Plus my potato bread I made on the ANZAC Day” and ”Cliff Chiang on reintroducing Orion in Wonder Woman! Plus interviews”. On the other hand, these sentences don’t have high embedding similarity with the explanation. Feature 10 of layer 16 has the opposite situation happening, with an embedding score of 0.74 and a simulation score of only 0.04. Its interpretation is ”Verbs, prepositions, and adverbs that connect clauses or indicate direction, movement, or progression, often in a sequence of actions or events”. In sentences like ”right, three months ¡to close down¿. There was dead silence when the message was read. Everybody waited for Mr. Smith to speak. Mr. Gingham” the simulator model incorrectly identifies ” waited” and the second ” to” as active, and misses the real activation in ”to close down”. The same happens in ”This is ¡going¿ to ¡be¿ one ¡swinging¿ party. Especially if the guests survive Uncle Wilhelm. I love weddings. I love going to them, I love being in”, where the simulator model identified ”love being in” as well as ”love going to” as being active. The embedding model correctly identified that most activating sentences have the mention of motion. With these examples, we hope to demonstrate that these generated explanations are not perfect, but that we have an easier understanding of their flaws by looking at which cases different scores disagree on 3.3.4 EMBEDDING Classifying between active and non-active contexts given a certain interpretation can also be seen as using interpretations of latents as “queries” that should be able to retrieve relevant “documents”, contexts where the latent is active, between non-relevant “documents”, non-activating contexts. This way, we take a selection of activating and non-activating contexts that embedded by an encoding transformer, and the similarity between the query and the documents is used as a classifier to distinguish between activating and non-activating contexts, and the score is given by the AUROC. If the encoding model is small enough - we used a 400M parameter model - this technique is the fastest and opens up the possibility to evaluate a larger fraction of the activation distribution. We have seen that using a larger embedding model - 7B parameter model - didn’t significantly improve the scores, see fig A3, although we believe that this approach was under-investigated. Details on the prompt, on the embedding model and on the way to compute the score in Appendix A.4.4. Self-Explanation 이거 맨처음 하는것도 중요한데 Brilliant and cheap method that puts placeholder X in the prompt and adds steering vector to the language model to generate self-explanation. As the quality of explanations varies depending on the insertion vector scale, we combine **self-similarity** and **entropy** metrics to automatically search for the optimal scale. Verification shows similar or superior interpretation accuracy compared to the [LLM Neuron explainer](https://www.notion.so/LLM-Neuron-explainer-d089a27e16834eacbe7044f9c0484688?pvs=21) method. ### Limitation Model inherent bias significantly affects the quality of descriptions. ### Future Works As expected, for [Single-token feature](https://www.notion.so/Single-token-feature-18bc3c96247d80bab638c74ab17b85ec?pvs=21) (one kind of Activating Tokens), it cannot generate descriptions that explain the token itself. However, this could actually be beneficial as it helps filter out such context-independent tokens. Successful explanations show a certain threshold of cosine similarity (Self-Similarity) between the final layer residual vector and the original SAE feature, which can be used as a Failure Detection metric for SAE features. <SAE Self Explanation LessWrong> 이거 참고해서 구현하면 댐
FaithfulSAE Cameraready
Date
Date
2025 Jul 1 0:0 → 2025 Jul 2 0:0Created by
Created by
Seonglae ChoCreated time
Created time
2025 Jul 1 22:49Last edited by
Last edited by
Seonglae ChoLast edited time
Last edited time
2025 Jul 1 22:51Refs
Refs