AI Feature

Write vector

Not just a convenient post-hoc description, some fundamental sense composed of features

Vector written to the residual stream by a node

Feature Learning Notion

AI Condition Vector

AI Feature Dimensionality

AI Feature Composition

AI Feature Selection

Topic model

Concept Space

Contrast-Consistent Search

Concept Extrapolation

Recursive Feature Machine

AI Feature Metrics

Permutation feature importance

Partial Dependence Plots

Gini importance

IIA

Safety relevant feature

Circuits Updates - July 2023

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2023/july-update/index.html#safety-features

Removing features had a greater impact on the model than amplifying features. This suggests that the influence of features may saturate at high activations

Circuits Updates - April 2024

https://transformer-circuits.pub/2024/april-update/index.html#ablation-exps

Relational composition

How neural networks combine feature vectors to represent complex relationships. Neural nets use vector addition for ordered relationships, vector differences for grammatical relationships, outer products to composite complex structures and interactions, and positional encodings for ID referencing.

arxiv.org

https://arxiv.org/pdf/2407.14662

Metric

Explainability — holisticai documentation

Despite the remarkable recent evolution in prediction performance by artificial intelligence (AI) models, they are often deemed as “black boxes”, i.e., models whose prediction mechanisms cannot be understood simply from their parameters. Explainability in machine learning refers to the ability to understand and articulate how models arrive at their predictions. This is crucial for promoting transparency, trust, and accountability in AI systems. It helps in verifying model behavior, refining models, debugging unexpected behavior, and communicating model decisions to stakeholders.

https://holisticai.readthedocs.io/en/latest/getting_started/explainability/index.html

AI Feature

Write vector

Safety relevant feature

Relational composition

Backlinks

Recommendations