AI Neural Circuit history

Creator

Creator

Seonglae Cho

Created

Created

2024 Oct 14 11:19

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Feb 4 17:13

Refs

Refs

An Introduction to Circuits (OpenAI 2020) Chris Olah — An Introduction to Circuits (OpenAI 2020) Chris Olah

OpenAI claims

AI Circuit in above paper with restating

Universality Hypothesis

A Mathematical Framework for Transformer Circuits (Anthropic 2021) Nelson Elhage

Toward Transparent AI (2022 July) Tilman Rauker

INTERPRETABILITY IN THE WILD (2022 Nov) Kevin Wang

Identify all previous names in the sentence (Mary, John, John).

Remove all names that are duplicated (in the example above: John).

Output the remaining name (Mary).

GPT2 circuit analysis

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small — AI Alignment Forum

To learn more about this work, check out the paper. We assume general familiarity with transformer circuits. …

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small — AI Alignment Forum

https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small — AI Alignment Forum

https://arxiv.org/pdf/2211.00593.pdf

Anthropic

Transformer Circuits Thread

Can we reverse engineer transformer language models into human-understandable computer programs? Inspired by the Distill Circuits Thread, we're going to try.

https://transformer-circuits.pub/

OpenAI

Thread: Circuits

What can we learn if we invest heavily in reverse engineering a single neural network?

https://distill.pub/2020/circuits/

Thread: Circuits

One-layer skip trigram

One-layer transformers aren’t equivalent to a set of skip-trigrams — LessWrong

(thanks to Tao Lin and Ryan Greenblatt for pointing this out, and to Arthur Conmy, Jenny Nitishinskaya, Thomas Huck, Neel Nanda, and Lawrence Chan, B…

One-layer transformers aren’t equivalent to a set of skip-trigrams — LessWrong

https://www.lesswrong.com/posts/b5HNYh9ne5vEkX5ag/one-layer-transformers-aren-t-equivalent-to-a-set-of-skip

One-layer transformers aren’t equivalent to a set of skip-trigrams — LessWrong

Recommendations

//////////