Explainable AI

Creator

Creator

Seonglae Cho

Created

Created

2023 Mar 4 8:38

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Oct 24 23:9

Refs

Refs

Reversing Transformer

XAI

Explainability

Within the black box AI model, attempting to understand the decision-making process

Explainable AI notion

Interpretable AI

AI Class Activation

Indirect object identification

Neural Turing Machines

Neural
Turing Machine (2014 Google)

https://arxiv.org/pdf/1410.5401.pdf

Circuit, Superposition, Universality (2020 OpenAI)

Thread: Circuits

What can we learn if we invest heavily in reverse engineering a single neural network?

https://distill.pub/2020/circuits/

Thread: Circuits

Residual Stream (2021 Anthropic)

A Mathematical Framework for Transformer Circuits

Transformer language models are an emerging technology that is gaining increasingly broad real-world use, for example in systems like GPT-3 , LaMDA , Codex , Meena , Gopher , and similar models. However, as these models scale, their open-endedness and high capacity creates an increasing scope for unexpected and sometimes harmful behaviors. Even years after a large model is trained, both creators and users routinely discover model capabilities – including problematic behaviors – they were previously unaware of.

A Mathematical Framework for Transformer Circuits

https://transformer-circuits.pub/2021/framework/index.html

Circuit analysis & Grokking (2022 OpenAI)

https://arxiv.org/pdf/2201.02177.pdf

https://arxiv.org/pdf/2207.13243.pdf

https://arxiv.org/pdf/2211.00593.pdf

Neuron Analysis & FSM (2023 Anthropic)

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features/index.html

AutoInterpretation Finds Sparse Coding Beats Alternatives — LessWrong

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort …

AutoInterpretation Finds Sparse Coding Beats Alternatives — LessWrong

https://www.lesswrong.com/posts/ursraZGcpfMjCXtnn/autointerpretation-finds-sparse-coding-beats-alternatives

AutoInterpretation Finds Sparse Coding Beats Alternatives — LessWrong

Wiki

Explainable artificial intelligence

Explainable AI (XAI), often overlapping with Interpretable AI, or Explainable Machine Learning (XML), either refers to an artificial intelligence (AI) system over which it is possible for humans to retain intellectual oversight, or refers to the methods to achieve this.[1][2] The main focus is usually on the reasoning behind the decisions or predictions made by the AI[3] which are made more understandable and transparent.[4] XAI counters the "black box" tendency of machine learning, where even the AI's designers cannot explain why it arrived at a specific decision.[5][6]

Explainable artificial intelligence

https://en.wikipedia.org/wiki/Explainable_artificial_intelligence

Strategies

Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?

Toy Models of Superposition

https://transformer-circuits.pub/2022/toy_model/index.html#strategic

Backlinks

Reasoning Model Abstractive Summarization FeatUp Attention Mechanism

Recommendations

//////