Claude 3 Sonnet

Creator

Creator

Created

Created

2024 Mar 5 13:23

Editor

Editor

Edited

Edited

2024 Dec 1 15:22

Refs

Refs

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#assessing-tour-influence

Feature Browser

Feature Browser

Feature Browser

https://transformer-circuits.pub/2024/scaling-monosemanticity/features/index.html

Backlinks

Monosemanticity

Recommendations

///////////