AI Feature

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 6 13:13
Editor
Edited
Edited
2025 Dec 27 23:56

Write vector

Not just a convenient post-hoc description, some fundamental sense composed of features
Vector written to the residual stream by a node
Feature Learning Notion
 
 
AI Feature Metrics
 
 

Safety relevant feature

Circuits Updates - July 2023
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
Removing features had a greater impact on the model than amplifying features. This suggests that the influence of features may saturate at high activations
Circuits Updates - April 2024
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

Relational composition

How neural networks combine feature vectors to represent complex relationships. Neural nets use vector addition for ordered relationships, vector differences for grammatical relationships, outer products to composite complex structures and interactions, and positional encodings for ID referencing.
arxiv.org
Metric
Explainability — holisticai documentation
Despite the remarkable recent evolution in prediction performance by artificial intelligence (AI) models, they are often deemed as “black boxes”, i.e., models whose prediction mechanisms cannot be understood simply from their parameters. Explainability in machine learning refers to the ability to understand and articulate how models arrive at their predictions. This is crucial for promoting transparency, trust, and accountability in AI systems. It helps in verifying model behavior, refining models, debugging unexpected behavior, and communicating model decisions to stakeholders.
 
 

Recommendations