Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Object/NLP/Language Model/LLM/Chat AI/Claude AI/Claude 3/
Claude 3 Sonnet
Search

Claude 3 Sonnet

Creator
Creator
Seonglae Cho
Created
Created
2024 Mar 5 13:23
Editor
Editor
Seonglae Cho
Edited
Edited
2024 Dec 1 15:22
Refs
Refs
 
 
 
 
 
 
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#assessing-tour-influence

Feature Browser

Feature Browser
Feature Browser
https://transformer-circuits.pub/2024/scaling-monosemanticity/features/index.html
 
 

Backlinks

Monosemanticity

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Object/NLP/Language Model/LLM/Chat AI/Claude AI/Claude 3/
Claude 3 Sonnet
Copyright Seonglae Cho