Write vector
Not just a convenient post-hoc description, some fundamental sense composed of features
Vector written to the residual stream by a node
Feature Learning Notion
AI Feature Metrics
Safety relevant feature
Circuits Updates - July 2023
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
https://transformer-circuits.pub/2023/july-update/index.html#safety-features
Removing features had a greater impact on the model than amplifying features. This suggests that the influence of features may saturate at high activations
Circuits Updates - April 2024
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
https://transformer-circuits.pub/2024/april-update/index.html#ablation-exps
Relational composition
How neural networks combine feature vectors to represent complex relationships. Neural nets use vector addition for ordered relationships, vector differences for grammatical relationships, outer products to composite complex structures and interactions, and positional encodings for ID referencing.
Metric
Explainability — holisticai documentation
Despite the remarkable recent evolution in prediction performance by artificial intelligence (AI) models, they are often deemed as “black boxes”, i.e., models whose prediction mechanisms cannot be understood simply from their parameters. Explainability in machine learning refers to the ability to understand and articulate how models arrive at their predictions. This is crucial for promoting transparency, trust, and accountability in AI systems. It helps in verifying model behavior, refining models, debugging unexpected behavior, and communicating model decisions to stakeholders.
https://holisticai.readthedocs.io/en/latest/getting_started/explainability/index.html

Seonglae Cho