sae dataset 어떤거 사용해야하는지에 대한 연구는 거의 없었다

- More important thing is that feature inspection for model and the main cause is there are a lot of ways to combining feature basis to explain LLM representation.

You guys can only consider registration if you are interested
- Harryn Oh harryn.oh.21@ucl.ac.uk
- Donghyun Lee donghyun.lee.21@ucl.ac.uk
- Luis Eduardo Rodrigues Vieira luis.vieira.21@ucl.ac.uk
- Andrew Bermingham andrew.bermingham.24@ucl.ac.uk
- Ziad El Sayed ziad.sayed.24@ucl.ac.uk
- fake feature
- downstream - recon acc
sae 는 그냥 선형 조합으로 아무거나 찾는거임 수학적으로 interpretable 하지 않고. 여러 basis 찾는거다 보니 일관성이 없는 것. feature might be flexible if there is cannot found optimized hyperparameters to find true feature set. Correlation among features might make coefficients uninterpretable. L1 regularization might pick up a random feature from a correlated group. Explaining the model ≠ explained the data. Model inspection only provides information about the model, The model might not accurately reflect the data. That means “Interpretability” is unreliable. 결론으로 sae 디스가능 - 이라기보단 그냥 언어 자체가 feature 고정이 아니라 여러 조합으로 표현 가능하다.
Sparsity loss randomly suppress features → This is the main reason of feature matching ratio is sensitive to seed difference. We mis-guess this is due to the dataset’s complexity. Explaining the model Explaining the dataset SAE is sensitive to training dataset but that does not necessarily means that SAE reflects the dataset. This is the biggest wrong assumption we’ve made. More important thing is feature inspection for model and the main cause is there are a lot of ways to combining feature basis to explain LLM representation. SAEs are Highly low reproducibility
Key Challenges
- 학습된 feature 들이 충분이 interpretable 하고 유용하다는 것 by llm explainer etc
synthetic dataset 이 model capability 를 충분히 cover 할정도로 다양하다는 것도 보여야하고. 학습된 feature 들이 충분이 interpretable 하고 유용하다는 것도 보여야할듯
- fake feature
- “dataset diversity 가 그냥 좁아서 seed 에 robust 한거 아니냐” 이걸 반박해야함
Rules
- Bflot 으로 진행해야할듯
해결법
ce difference is most import to insist “faithful” sae
toy model paper 처럼 그냥 mathematicaly efiicient 한 feature dieection 고정분배 해두고 비교하면 interpretability 만 체크하면 안되나 (즉 디코더 고정
feature 중에서 의미있는 feature 가 더 의미있다고 했을때 그중에서 비율 얼마일지, 그것만 했을때 더 의미잇을지 더 검사해보기
Report this week
- sensitivity to hyperparmater → we should learn from training
- scaled model is under linear which means still heard to converge and takes time to get activation
- diversity really matters but the good signal is that ptop p and temperature somehow worked
- But this is still early stage of training 1e-7 / 1e-8 and 1-9 (gpt2, 4e8 is convergece)
- Dead neuron convergence matters more
Experiments
- model size
- dataset - pretraining synthetic 조합
- synthetic 으로 만든 topk sae 에서 topk 줄여서 비교
- 기존 sae 에서 topk 줄여도 synthetic dataset 에서 잘 작동할것으로 예상
- 다만 online 에서 sparse activated 되기에 실제 쓰는건 대부분 의미있을듯
linear representation theory 근거가 된다
fake feature 가 없다보니 non-linearity 가 적고, eluther embedding 이 더 algitn 되서 높아질 것으로 예상
Write and share readme
Exp 8
Architecture agnostic
Exp 9
using same architecture (pythia)
Every pythia is trained on the pile
- 12b - synthetic dataset - diverse superset
- 6b - synthetic dataset - …
- 2b - synthetic dataset - …
- 1b - synthetic dataset
Assumption: Bigger model enclose smaller model’s capability
- We expect if we train SAEs from bigger model’s synthetic data, lower seed robustness
- Vise versa
- Detail: we need consistent dictionary size or feature level
Models
- GPT2
- LLaMa 3.1 3.2
- Pythia
- Gemma 2
- (Mistral)

- k16 - 768, 768 * 16
폴만 없었다면 참 분위기가 좋았을텐데, 좋은 팀원들 전체 잃는게 아쉽다. 무슨 말만 해도 발작버튼 눌리니 함부러 맘ㄹ할수 없는 커뮤니케이션 문화 만들고 피해망상 오지니까.
예민하게 받아들이니 나까지 온라인상에서 함부러 못말하겠고 그래서 더 오프라인에서 거칠어지고 분위기를 좋게 못가져가겠다. 이미 싫어져버려서 어쩔수가 없다. 서구 유럽 영미권에서는 뭔가 hate 하면 안되는 분위기인데 나는 그냥 싫다.
April 23rd
mentations that first in the abstract ~ saes are most scable unsupervised way of training interpretable faetures
implemenetaion dataset two differnce sae choice and datasset section
8 pages except limit
Faithful SAE downstream Proving
Faithful SAE Fake Feature
- top corellated feature 하나씩 잡고
feat_match 하면서 동시에 진행 전부
실험 설계: 랜덤 노이즈로 Fake Feature 개수 비교
- 데이터 준비
- In-distribution 샘플: 모델이 학습한 도메인(예: Wikipedia, WebText)에서 무작위로 1,000문장 뽑기
- OOD 노이즈 샘플: 전혀 모델이 본 적 없는 “랜덤 토큰 시퀀스” 1,000문장 생성
- Feature 추출
- Standard SAE 와 Faithful SAE 두 종류에 대해, 각 문장별 token-level SAE activation ft∈Rmf_{t}\in\mathbb R^{m}ft∈Rm → sequence-level aggregation F=∑tftF=\sum_{t}f_{t}F=∑tft
- 이진화: Fbin[i]=1F[i]>τF_{bin}[i]=\mathbf{1}_{F[i]>τ}Fbin[i]=1F[i]>τ (예: τ=1τ=1τ=1)
- Fake Feature 정의
- Fake Feature: OOD 노이즈에 대해 >p%>p\%>p% (예: p=5%p=5\%p=5%) 문장에서 firing 되는 feature
- 즉,FakeFeatures={i:10001j=1∑1000Fbin(noisej)[i]>0.05}}
- 측정 지표
- #FakeFeaturesStandard\#\text{FakeFeatures}_{\text{Standard}}#FakeFeaturesStandard vs. #FakeFeaturesFaithful\#\text{FakeFeatures}_{\text{Faithful}}#FakeFeaturesFaithful
- 가설: Faithful SAE 쪽이 훨씬 적은 Fake Features를 가짐
- 통계 검정
- bootstrapping 으로 feature count 분포 뽑아서 두 그룹 간 p-value 확인
왜 이게 “Fake Feature” 실험인가?
- 랜덤 노이즈는 모델이 절대 “합리적” 으로 해석할 수 없는 입력
- 그럼에도 불구하고 SAE에서 자주 firing 되는 feature들이 “허깨비(fantasy) feature”—즉 진짜 모델 internal concept이 아님
- Faithful SAE는 OOD 의존 없이 self-generated data로만 학습했기 때문에, 이런 Fake Feature 수가 더 적어야 함
Faithful SAE Downstream Provings
Model
Training
F1
Classification
Name
yelp 가 gpt ;랑 pythia 이상함
Legacy results
gpt2
{ "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_faithful-gpt2-small_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_faithful-gpt2-small_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6381880733944953, "recon_acc": 0.5819954128440368, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6318821315090541, "recon_f1": 0.5054111627039339, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6936720997123682, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.42136475138804325, "recon_f1": 0.5230834754391598, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7680263157894738, "recon_acc": 0.750328947368421, "baseline_f1": 0.858948305546186, "sae_f1": 0.7686617676668895, "recon_f1": 0.7505566472182452, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8151184210526317, "recon_acc": 0.7999736842105263, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8149999994807476, "recon_f1": 0.7992663404596958, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_fineweb_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_fineweb_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6313073394495412, "recon_acc": 0.5768348623853211, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6190008478699911, "recon_f1": 0.49530534745128274, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6912751677852349, "recon_acc": 0.6812080536912752, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4161159819399749, "recon_f1": 0.5238862923900236, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7672368421052631, "recon_acc": 0.7452631578947368, "baseline_f1": 0.858948305546186, "sae_f1": 0.7677191066746017, "recon_f1": 0.7456838796788295, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8185, "recon_acc": 0.8016184210526316, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8184214690963898, "recon_f1": 0.8004130165513998, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_pile-uncopyrighted_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_pile-uncopyrighted_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6869266055045872, "recon_acc": 0.6009174311926606, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6829182968443854, "recon_f1": 0.5417190294774679, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6907957813998082, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4115262391100577, "recon_f1": 0.5363058957046498, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.77, "recon_acc": 0.7582894736842105, "baseline_f1": 0.858948305546186, "sae_f1": 0.7708296359063063, "recon_f1": 0.7590131643202838, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8099736842105263, "recon_acc": 0.7971578947368421, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8098772521989184, "recon_f1": 0.7963460956997179, } } }
gemma 2b
{ "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_faithful-gemma2-2b_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_faithful-gemma2-2b_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6639908256880733, "recon_acc": 0.6548165137614679, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6555364833705732, "recon_f1": 0.6463415491665134, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.7018216682646212, "recon_acc": 0.6879194630872483, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5426634423577605, "recon_f1": 0.5362575660160247, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7454605263157894, "recon_acc": 0.7459868421052631, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7471748214266285, "recon_f1": 0.7490910079276412, }, "yelp_polarity": { "baseline_acc": 0.6705, "sae_acc": 0.6401315789473685, "recon_acc": 0.6371842105263158, "baseline_f1": 0.6433888923712144, "sae_f1": 0.6092801899430627, "recon_f1": 0.6062209211703413, } }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_fineweb_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_fineweb_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6754587155963303, "recon_acc": 0.6559633027522935, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6689383392077671, "recon_f1": 0.6451221139029649, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.6936720997123682, "recon_acc": 0.6821668264621285, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5553361529543475, "recon_f1": 0.5383967336884994, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7619736842105262, "recon_acc": 0.7624342105263158, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7646235556915335, "recon_f1": 0.7659606745601961, } } }
llama 3b
{ "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_faithful-llama3.2-3b_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_faithful-llama3.2-3b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7763761467889908, "recon_acc": 0.8405963302752294, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7756670460950286, "recon_f1": 0.8405587989220263, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7425695110258869, "recon_acc": 0.7708533077660594, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6014255571865801, "recon_f1": 0.6896693761389634, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8459210526315789, "recon_acc": 0.8769736842105263, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8452459442447388, "recon_f1": 0.8769955683104758, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_fineweb_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7620412844036697, "recon_acc": 0.8342889908256881, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7602430376398335, "recon_f1": 0.8340347190661116, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7387344199424737, "recon_acc": 0.7641418983700863, "baseline_f1": 0.7152162539211186, "sae_f1": 0.5946387101656074, "recon_f1": 0.6826465916857021, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8479605263157894, "recon_acc": 0.8742105263157894, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8475250080410112, "recon_f1": 0.8741664410440275, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_pile-uncopyrighted_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7729357798165137, "recon_acc": 0.8297018348623852, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7720794006212012, "recon_f1": 0.8295297429409766, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7416107382550337, "recon_acc": 0.7646212847555129, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6028704698560002, "recon_f1": 0.6822661503487519, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.84875, "recon_acc": 0.8773026315789474, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8482719220887277, "recon_f1": 0.8772401349425806, } } }
llama8b
pythia 1.4
{ "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.625, "recon_acc": 0.6628440366972477, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5982676358780978, "recon_f1": 0.6569087986667486, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7176414189837008, "recon_acc": 0.7205177372962608, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5682400889034706, "recon_f1": 0.6250376214968614, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8449342105263158, "recon_acc": 0.8734868421052632, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8447155328402953, "recon_f1": 0.8735220331382518, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.5837155963302751, "recon_acc": 0.6399082568807339, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5252706925463307, "recon_f1": 0.6127023738032431, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7099712368168745, "recon_acc": 0.7157238734419942, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5558493449222908, "recon_f1": 0.6189774843153626, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8411184210526316, "recon_acc": 0.8690789473684211, "baseline_f1": 0.9067620442309687, "sae_f1": 0.840753324361916, "recon_f1": 0.869264166479926, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.6083715596330275, "recon_acc": 0.6634174311926606, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5637099029658558, "recon_f1": 0.6408800808584558, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7142857142857143, "recon_acc": 0.713326941514861, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5691509178971268, "recon_f1": 0.630212930327982, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8497368421052631, "recon_acc": 0.868421052631579, "baseline_f1": 0.9067620442309687, "sae_f1": 0.849209278774073, "recon_f1": 0.8685165182350671, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_FLAN_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_FLAN_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.591743119266055, "recon_acc": 0.6347477064220184, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5330501889425321, "recon_f1": 0.5987975177361987, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7037392138063279, "recon_acc": 0.7056567593480345, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5364699942307098, "recon_f1": 0.5907758660516629, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8282236842105264, "recon_acc": 0.855, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8277841057173874, "recon_f1": 0.8551377247582921, }, "yelp_polarity": { "baseline_acc": 0.9378552631578947, "sae_acc": 0.8199342105263158, "recon_acc": 0.8684342105263159, "baseline_f1": 0.9378443435018236, "sae_f1": 0.8167165640358525, "recon_f1": 0.8682904464532779, } }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_merged_uncensored_alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_merged_uncensored_alpaca_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.5974770642201834, "recon_acc": 0.6198394495412844, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5383769226221773, "recon_f1": 0.580481774864037, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7070949185043145, "recon_acc": 0.7214765100671141, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5463782289363686, "recon_f1": 0.6233436897375952, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.835, "recon_acc": 0.867171052631579, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8343879746772225, "recon_f1": 0.8673697344263738, }, "yelp_polarity": { "baseline_acc": 0.9378552631578947, "sae_acc": 0.8594868421052632, "recon_acc": 0.8900921052631579, "baseline_f1": 0.9378443435018236, "sae_f1": 0.8584010102007948, "recon_f1": 0.8900115172343989, } }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_open-instruct-uncensored-alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_open-instruct-uncensored-alpaca_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.6037844036697249, "recon_acc": 0.6376146788990826, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5554244093038755, "recon_f1": 0.611534593194968, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7080536912751678, "recon_acc": 0.7166826462128475, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5517751045102514, "recon_f1": 0.6279544326917945, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8393421052631579, "recon_acc": 0.8647368421052631, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8386828043318748, "recon_f1": 0.8650205433625378, }, "yelp_polarity": { "baseline_acc": 0.9378552631578947, "sae_acc": 0.8686447368421053, "recon_acc": 0.8956447368421052, "baseline_f1": 0.9378443435018236, "sae_f1": 0.8677390498989366, "recon_f1": 0.8955579157849463, } } }
llama 1b
{ "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_faithful-llama3.2-1b_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_faithful-llama3.2-1b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6771788990825688, "recon_acc": 0.7069954128440367, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6604980539245782, "recon_f1": 0.7044136847971996, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7056567593480345, "recon_acc": 0.7377756471716204, "baseline_f1": 0.6381572073006643, "sae_f1": 0.46315353559740324, "recon_f1": 0.6049876680314521, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7901315789473684, "recon_acc": 0.8191447368421052, "baseline_f1": 0.860867910090608, "sae_f1": 0.7887296126660821, "recon_f1": 0.8189521248408049, } }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.661697247706422, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6395535445833574, "recon_f1": 0.7227740144313253, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7075743048897412, "recon_acc": 0.736816874400767, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4728249632110051, "recon_f1": 0.6085546284839893, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7933552631578947, "recon_acc": 0.8174342105263158, "baseline_f1": 0.860867910090608, "sae_f1": 0.792079768259315, "recon_f1": 0.8172451481958556, }, }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6680045871559632, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6514118480752982, "recon_f1": 0.7235045413291168, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7070949185043145, "recon_acc": 0.7339405560882071, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4701210319231057, "recon_f1": 0.5992029535466976, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7904605263157896, "recon_acc": 0.8163157894736842, "baseline_f1": 0.860867910090608, "sae_f1": 0.789293579698522, "recon_f1": 0.8164310093089504, } } }
pythia 2.8
{ "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_faithful-pythia1.4b_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_faithful-pythia1.4b_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6485091743119267, "recon_acc": 0.6674311926605505, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6394634076507092, "recon_f1": 0.6507309664018397, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5023969319271333, "recon_acc": 0.62464046021093, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3266744710043679, "recon_f1": 0.45134193393801403, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5873026315789474, "recon_acc": 0.594078947368421, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5863143504752567, "recon_f1": 0.5964254104882054, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8822105263157896, "recon_acc": 0.8896578947368421, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8821992852293431, "recon_f1": 0.8894904792657169, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6353211009174311, "recon_acc": 0.7138761467889909, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6303691839253754, "recon_f1": 0.7061184907387809, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.6212847555129435, "baseline_f1": 0.3725260677388479, "sae_f1": 0.32386999539827843, "recon_f1": 0.4578079209216317, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6030921052631579, "recon_acc": 0.5969736842105263, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6009567444105985, "recon_f1": 0.6001021008319489, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8942368421052631, "recon_acc": 0.9004736842105263, "baseline_f1": 0.9379692987935441, "sae_f1": 0.894207685428926, "recon_f1": 0.9004008374365153, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_FLAN_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_FLAN_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.5802752293577982, "recon_acc": 0.6496559633027523, "baseline_f1": 0.8818021917383296, "sae_f1": 0.5662361526556061, "recon_f1": 0.638586294582631, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.5517737296260786, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3243620390042899, "recon_f1": 0.4103185016814908, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5373684210526316, "recon_acc": 0.5425, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5356346866398047, "recon_f1": 0.5498615517659742, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8440263157894736, "recon_acc": 0.8671842105263159, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8439853359439617, "recon_f1": 0.8671279564955111, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6559633027522935, "recon_acc": 0.6892201834862386, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6487296756412979, "recon_f1": 0.6835865602014071, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.5800575263662512, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3248497386693457, "recon_f1": 0.43478813762656554, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5732894736842105, "recon_acc": 0.59875, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5731572109943672, "recon_f1": 0.6027152942034848, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8780263157894737, "recon_acc": 0.8898421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8778787915295191, "recon_f1": 0.8897811277723608, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.669151376146789, "recon_acc": 0.661697247706422, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6633291361786984, "recon_f1": 0.6430236708749482, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5023969319271333, "recon_acc": 0.6160115052732502, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3271537002080098, "recon_f1": 0.4583511769225481, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6008552631578947, "recon_acc": 0.6094078947368421, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6005771593515399, "recon_f1": 0.6079249032315457, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8875131578947368, "recon_acc": 0.892421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8874637731864572, "recon_f1": 0.8923647758257278, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_merged_uncensored_alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_merged_uncensored_alpaca_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6227064220183487, "recon_acc": 0.6525229357798166, "baseline_f1": 0.8818021917383296, "sae_f1": 0.5909858500786827, "recon_f1": 0.6211549979390429, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5019175455417066, "recon_acc": 0.5407478427612655, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3259054141650223, "recon_f1": 0.40014348262347194, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5878289473684211, "recon_acc": 0.5875657894736842, "baseline_f1": 0.8468502569930477, "sae_f1": 0.59091706439711, "recon_f1": 0.5927898475190146, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8920921052631579, "recon_acc": 0.8955394736842105, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8920743805631737, "recon_f1": 0.8954434496310095, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_open-instruct-uncensored-alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_open-instruct-uncensored-alpaca_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6668577981651376, "recon_acc": 0.6811926605504588, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6587972733855976, "recon_f1": 0.6715270435207281, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5004793863854267, "recon_acc": 0.552732502396932, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3250463740674675, "recon_f1": 0.4103940652795335, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5691447368421052, "recon_acc": 0.5763815789473684, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5720069778443257, "recon_f1": 0.5804061236468246, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8923026315789474, "recon_acc": 0.8966315789473684, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8923009591141421, "recon_f1": 0.8965514180464571, } } }
gpt2
{ "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_faithful-gpt2-small_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_faithful-gpt2-small_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6381880733944953, "recon_acc": 0.5819954128440368, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6318821315090541, "recon_f1": 0.5054111627039339, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6936720997123682, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.42136475138804325, "recon_f1": 0.5230834754391598, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7680263157894738, "recon_acc": 0.750328947368421, "baseline_f1": 0.858948305546186, "sae_f1": 0.7686617676668895, "recon_f1": 0.7505566472182452, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8151184210526317, "recon_acc": 0.7999736842105263, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8149999994807476, "recon_f1": 0.7992663404596958, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_fineweb_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_fineweb_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6313073394495412, "recon_acc": 0.5768348623853211, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6190008478699911, "recon_f1": 0.49530534745128274, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6912751677852349, "recon_acc": 0.6812080536912752, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4161159819399749, "recon_f1": 0.5238862923900236, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7672368421052631, "recon_acc": 0.7452631578947368, "baseline_f1": 0.858948305546186, "sae_f1": 0.7677191066746017, "recon_f1": 0.7456838796788295, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8185, "recon_acc": 0.8016184210526316, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8184214690963898, "recon_f1": 0.8004130165513998, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_pile-uncopyrighted_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_pile-uncopyrighted_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6869266055045872, "recon_acc": 0.6009174311926606, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6829182968443854, "recon_f1": 0.5417190294774679, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6907957813998082, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4115262391100577, "recon_f1": 0.5363058957046498, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.77, "recon_acc": 0.7582894736842105, "baseline_f1": 0.858948305546186, "sae_f1": 0.7708296359063063, "recon_f1": 0.7590131643202838, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8099736842105263, "recon_acc": 0.7971578947368421, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8098772521989184, "recon_f1": 0.7963460956997179, } } }
gemma 2b
{ "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_faithful-gemma2-2b_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_faithful-gemma2-2b_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6639908256880733, "recon_acc": 0.6548165137614679, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6555364833705732, "recon_f1": 0.6463415491665134, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.7018216682646212, "recon_acc": 0.6879194630872483, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5426634423577605, "recon_f1": 0.5362575660160247, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7454605263157894, "recon_acc": 0.7459868421052631, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7471748214266285, "recon_f1": 0.7490910079276412, }, "yelp_polarity": { "baseline_acc": 0.6705, "sae_acc": 0.6401315789473685, "recon_acc": 0.6371842105263158, "baseline_f1": 0.6433888923712144, "sae_f1": 0.6092801899430627, "recon_f1": 0.6062209211703413, } }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_fineweb_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_fineweb_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6754587155963303, "recon_acc": 0.6559633027522935, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6689383392077671, "recon_f1": 0.6451221139029649, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.6936720997123682, "recon_acc": 0.6821668264621285, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5553361529543475, "recon_f1": 0.5383967336884994, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7619736842105262, "recon_acc": 0.7624342105263158, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7646235556915335, "recon_f1": 0.7659606745601961, } } }
llama 3b
{ "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_faithful-llama3.2-3b_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_faithful-llama3.2-3b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7763761467889908, "recon_acc": 0.8405963302752294, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7756670460950286, "recon_f1": 0.8405587989220263, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7425695110258869, "recon_acc": 0.7708533077660594, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6014255571865801, "recon_f1": 0.6896693761389634, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8459210526315789, "recon_acc": 0.8769736842105263, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8452459442447388, "recon_f1": 0.8769955683104758, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_fineweb_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7620412844036697, "recon_acc": 0.8342889908256881, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7602430376398335, "recon_f1": 0.8340347190661116, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7387344199424737, "recon_acc": 0.7641418983700863, "baseline_f1": 0.7152162539211186, "sae_f1": 0.5946387101656074, "recon_f1": 0.6826465916857021, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8479605263157894, "recon_acc": 0.8742105263157894, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8475250080410112, "recon_f1": 0.8741664410440275, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_pile-uncopyrighted_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7729357798165137, "recon_acc": 0.8297018348623852, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7720794006212012, "recon_f1": 0.8295297429409766, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7416107382550337, "recon_acc": 0.7646212847555129, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6028704698560002, "recon_f1": 0.6822661503487519, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.84875, "recon_acc": 0.8773026315789474, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8482719220887277, "recon_f1": 0.8772401349425806, } } }
pythia 1.4
{ "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.625, "recon_acc": 0.6628440366972477, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5982676358780978, "recon_f1": 0.6569087986667486, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7176414189837008, "recon_acc": 0.7205177372962608, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5682400889034706, "recon_f1": 0.6250376214968614, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8449342105263158, "recon_acc": 0.8734868421052632, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8447155328402953, "recon_f1": 0.8735220331382518, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.5837155963302751, "recon_acc": 0.6399082568807339, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5252706925463307, "recon_f1": 0.6127023738032431, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7099712368168745, "recon_acc": 0.7157238734419942, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5558493449222908, "recon_f1": 0.6189774843153626, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8411184210526316, "recon_acc": 0.8690789473684211, "baseline_f1": 0.9067620442309687, "sae_f1": 0.840753324361916, "recon_f1": 0.869264166479926, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.6083715596330275, "recon_acc": 0.6634174311926606, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5637099029658558, "recon_f1": 0.6408800808584558, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7142857142857143, "recon_acc": 0.713326941514861, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5691509178971268, "recon_f1": 0.630212930327982, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8497368421052631, "recon_acc": 0.868421052631579, "baseline_f1": 0.9067620442309687, "sae_f1": 0.849209278774073, "recon_f1": 0.8685165182350671, }, } }
llama 1b
{ "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_faithful-llama3.2-1b_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_faithful-llama3.2-1b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6771788990825688, "recon_acc": 0.7069954128440367, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6604980539245782, "recon_f1": 0.7044136847971996, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7056567593480345, "recon_acc": 0.7377756471716204, "baseline_f1": 0.6381572073006643, "sae_f1": 0.46315353559740324, "recon_f1": 0.6049876680314521, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7901315789473684, "recon_acc": 0.8191447368421052, "baseline_f1": 0.860867910090608, "sae_f1": 0.7887296126660821, "recon_f1": 0.8189521248408049, } }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.661697247706422, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6395535445833574, "recon_f1": 0.7227740144313253, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7075743048897412, "recon_acc": 0.736816874400767, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4728249632110051, "recon_f1": 0.6085546284839893, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7933552631578947, "recon_acc": 0.8174342105263158, "baseline_f1": 0.860867910090608, "sae_f1": 0.792079768259315, "recon_f1": 0.8172451481958556, }, }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6680045871559632, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6514118480752982, "recon_f1": 0.7235045413291168, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7070949185043145, "recon_acc": 0.7339405560882071, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4701210319231057, "recon_f1": 0.5992029535466976, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7904605263157896, "recon_acc": 0.8163157894736842, "baseline_f1": 0.860867910090608, "sae_f1": 0.789293579698522, "recon_f1": 0.8164310093089504, } } }
pythia 2.8
{ "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6559633027522935, "recon_acc": 0.6892201834862386, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6487296756412979, "recon_f1": 0.6835865602014071, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.5800575263662512, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3248497386693457, "recon_f1": 0.43478813762656554, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5732894736842105, "recon_acc": 0.59875, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5731572109943672, "recon_f1": 0.6027152942034848, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8780263157894737, "recon_acc": 0.8898421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8778787915295191, "recon_f1": 0.8897811277723608, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6353211009174311, "recon_acc": 0.7138761467889909, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6303691839253754, "recon_f1": 0.7061184907387809, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.6212847555129435, "baseline_f1": 0.3725260677388479, "sae_f1": 0.32386999539827843, "recon_f1": 0.4578079209216317, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6030921052631579, "recon_acc": 0.5969736842105263, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6009567444105985, "recon_f1": 0.6001021008319489, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8942368421052631, "recon_acc": 0.9004736842105263, "baseline_f1": 0.9379692987935441, "sae_f1": 0.894207685428926, "recon_f1": 0.9004008374365153, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.669151376146789, "recon_acc": 0.661697247706422, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6633291361786984, "recon_f1": 0.6430236708749482, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5023969319271333, "recon_acc": 0.6160115052732502, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3271537002080098, "recon_f1": 0.4583511769225481, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6008552631578947, "recon_acc": 0.6094078947368421, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6005771593515399, "recon_f1": 0.6079249032315457, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8875131578947368, "recon_acc": 0.892421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8874637731864572, "recon_f1": 0.8923647758257278, } } }
❯ python analyze_data.py Average sae_acc and sae_f1 across models and tasks: faithful: sae_acc=0.7042, sae_f1=0.6347 fineweb: sae_acc=0.7009, sae_f1=0.6298 pile-uncopyrighted: sae_acc=0.7056, sae_f1=0.6342 Pairwise comparisons: faithful vs fineweb: Average acc diff: 0.0034 (10/15 wins) Average f1 diff: 0.0049 (9/15 wins) faithful vs pile-uncopyrighted: Average acc diff: -0.0014 (7/15 wins) Average f1 diff: 0.0005 (6/15 wins) fineweb vs faithful: Average acc diff: -0.0034 (5/15 wins) Average f1 diff: -0.0049 (6/15 wins) fineweb vs pile-uncopyrighted: Average acc diff: -0.0047 (7/15 wins) Average f1 diff: -0.0044 (7/15 wins) pile-uncopyrighted vs faithful: Average acc diff: 0.0014 (7/15 wins) Average f1 diff: -0.0005 (9/15 wins) pile-uncopyrighted vs fineweb: Average acc diff: 0.0047 (8/15 wins) Average f1 diff: 0.0044 (8/15 wins) Analysis by model: GPT2: faithful: sae_acc=0.7000, sae_f1=0.6073 fineweb: sae_acc=0.6966, sae_f1=0.6009 pile: sae_acc=0.7159, sae_f1=0.6218 Llama-3B: faithful: sae_acc=0.7883, sae_f1=0.7408 fineweb: sae_acc=0.7829, sae_f1=0.7341 pile: sae_acc=0.7878, sae_f1=0.7411 Llama-1B: faithful: sae_acc=0.7243, sae_f1=0.6375 fineweb: sae_acc=0.7209, sae_f1=0.6348 pile: sae_acc=0.7219, sae_f1=0.6369 Pythia-1.4B: faithful: sae_acc=0.7292, sae_f1=0.6704 fineweb: sae_acc=0.7241, sae_f1=0.6607 pile: sae_acc=0.7116, sae_f1=0.6406 Pythia-2.8B: faithful: sae_acc=0.5794, sae_f1=0.5175 fineweb: sae_acc=0.5798, sae_f1=0.5184 pile: sae_acc=0.5908, sae_f1=0.5304
Fake feature
gemma 2b
{ "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_faithful-gemma2-2b_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_faithful-gemma2-2b_1024_9764": { "avg_fake_feature_ratio": 0.006564670138888889, "avg_fake_feature_count1": 121.0, "fake_feature_count1": 120, "fake_feature_count2": 122, "fake_feature_ratio1": 0.006510416666666667, "fake_feature_ratio2": 0.006618923611111111, "d_sae1": 18432, "d_sae2": 18432 }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_fineweb_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_fineweb_1024_9764": { "avg_fake_feature_ratio": 0.007161458333333333, "avg_fake_feature_count1": 132.0, "fake_feature_count1": 127, "fake_feature_count2": 137, "fake_feature_ratio1": 0.006890190972222222, "fake_feature_ratio2": 0.007432725694444444, "d_sae1": 18432, "d_sae2": 18432 }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_pile-uncopyrighted_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_pile-uncopyrighted_1024_9764": { "avg_fake_feature_ratio": 0.006673177083333333, "avg_fake_feature_count1": 123.0, "fake_feature_count1": 118, "fake_feature_count2": 128, "fake_feature_ratio1": 0.006401909722222222, "fake_feature_ratio2": 0.006944444444444444, "d_sae1": 18432, "d_sae2": 18432 } }
gpt2
{ "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_faithful-gpt2-small_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_faithful-gpt2-small_128_24413": { "avg_fake_feature_ratio": 0.003011067708333333, "avg_fake_feature_count1": 37.0, "fake_feature_count1": 36, "fake_feature_count2": 38, "fake_feature_ratio1": 0.0029296875, "fake_feature_ratio2": 0.0030924479166666665, "d_sae1": 12288, "d_sae2": 12288 }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_fineweb_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_fineweb_128_24413": { "avg_fake_feature_ratio": 0.002726236979166667, "avg_fake_feature_count1": 33.5, "fake_feature_count1": 34, "fake_feature_count2": 33, "fake_feature_ratio1": 0.0027669270833333335, "fake_feature_ratio2": 0.002685546875, "d_sae1": 12288, "d_sae2": 12288 }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_pile-uncopyrighted_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_pile-uncopyrighted_128_24413": { "avg_fake_feature_ratio": 0.002644856770833333, "avg_fake_feature_count1": 32.5, "fake_feature_count1": 33, "fake_feature_count2": 32, "fake_feature_ratio1": 0.002685546875, "fake_feature_ratio2": 0.0026041666666666665, "d_sae1": 12288, "d_sae2": 12288 } }
llama1
{ "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_faithful-llama3.2-1b_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_faithful-llama3.2-1b_512_195311": { "avg_fake_feature_ratio": 0.007882254464285714, "avg_fake_feature_count1": 113.0, "fake_feature_count1": 120, "fake_feature_count2": 106, "fake_feature_ratio1": 0.008370535714285714, "fake_feature_ratio2": 0.007393973214285714, "d_sae1": 14336, "d_sae2": 14336 }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_195311": { "avg_fake_feature_ratio": 0.007045200892857143, "avg_fake_feature_count1": 101.0, "fake_feature_count1": 101, "fake_feature_count2": 101, "fake_feature_ratio1": 0.007045200892857143, "fake_feature_ratio2": 0.007045200892857143, "d_sae1": 14336, "d_sae2": 14336 }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_195311": { "avg_fake_feature_ratio": 0.007463727678571428, "avg_fake_feature_count1": 107.0, "fake_feature_count1": 109, "fake_feature_count2": 105, "fake_feature_ratio1": 0.007603236607142857, "fake_feature_ratio2": 0.00732421875, "d_sae1": 14336, "d_sae2": 14336 } }
llama3
{ "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_faithful-llama3.2-3b_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_faithful-llama3.2-3b_512_195311": { "avg_fake_feature_ratio": 0.14591471354166669, "avg_fake_feature_count1": 2689.5, "fake_feature_count1": 2652, "fake_feature_count2": 2727, "fake_feature_ratio1": 0.14388020833333334, "fake_feature_ratio2": 0.14794921875, "d_sae1": 18432, "d_sae2": 18432 }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_fineweb_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_fineweb_512_195311": { "avg_fake_feature_ratio": 0.13460286458333331, "avg_fake_feature_count1": 2481.0, "fake_feature_count1": 2536, "fake_feature_count2": 2426, "fake_feature_ratio1": 0.13758680555555555, "fake_feature_ratio2": 0.1316189236111111, "d_sae1": 18432, "d_sae2": 18432 }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_pile-uncopyrighted_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_pile-uncopyrighted_512_195311": { "avg_fake_feature_ratio": 0.14176432291666669, "avg_fake_feature_count1": 2613.0, "fake_feature_count1": 2599, "fake_feature_count2": 2627, "fake_feature_ratio1": 0.14100477430555555, "fake_feature_ratio2": 0.1425238715277778, "d_sae1": 18432, "d_sae2": 18432 } }
llama8
{ "Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_42_faithful-llama3.1-8b_512_292967-Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_49_faithful-llama3.1-8b_512_292967": { "avg_fake_feature_ratio": 0.036041259765625, "avg_fake_feature_count1": 590.5, "fake_feature_count1": 605, "fake_feature_count2": 576, "fake_feature_ratio1": 0.03692626953125, "fake_feature_ratio2": 0.03515625, "d_sae1": 16384, "d_sae2": 16384 }, "Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_42_fineweb_512_292967-Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_49_fineweb_512_292967": { "avg_fake_feature_ratio": 0.03955078125, "avg_fake_feature_count1": 648.0, "fake_feature_count1": 643, "fake_feature_count2": 653, "fake_feature_ratio1": 0.03924560546875, "fake_feature_ratio2": 0.03985595703125, "d_sae1": 16384, "d_sae2": 16384 }, "Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_42_pile-uncopyrighted_512_292967-Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_49_pile-uncopyrighted_512_292967": { "avg_fake_feature_ratio": 0.040130615234375, "avg_fake_feature_count1": 657.5, "fake_feature_count1": 663, "fake_feature_count2": 652, "fake_feature_ratio1": 0.04046630859375, "fake_feature_ratio2": 0.039794921875, "d_sae1": 16384, "d_sae2": 16384 } }
pythia1
{ "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "avg_fake_feature_ratio": 0.07913643973214285, "avg_fake_feature_count1": 1134.5, "fake_feature_count1": 1059, "fake_feature_count2": 1210, "fake_feature_ratio1": 0.07386997767857142, "fake_feature_ratio2": 0.08440290178571429, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "avg_fake_feature_ratio": 0.07662527901785715, "avg_fake_feature_count1": 1098.5, "fake_feature_count1": 1101, "fake_feature_count2": 1096, "fake_feature_ratio1": 0.07679966517857142, "fake_feature_ratio2": 0.07645089285714286, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "avg_fake_feature_ratio": 0.07847377232142858, "avg_fake_feature_count1": 1125.0, "fake_feature_count1": 1139, "fake_feature_count2": 1111, "fake_feature_ratio1": 0.07945033482142858, "fake_feature_ratio2": 0.07749720982142858, "d_sae1": 14336, "d_sae2": 14336 } }
pythia2
{ "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "avg_fake_feature_ratio": 0.005566406249999999, "avg_fake_feature_count1": 85.5, "fake_feature_count1": 87, "fake_feature_count2": 84, "fake_feature_ratio1": 0.0056640625, "fake_feature_ratio2": 0.00546875, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "avg_fake_feature_ratio": 0.00634765625, "avg_fake_feature_count1": 97.5, "fake_feature_count1": 96, "fake_feature_count2": 99, "fake_feature_ratio1": 0.00625, "fake_feature_ratio2": 0.0064453125, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "avg_fake_feature_ratio": 0.006087239583333333, "avg_fake_feature_count1": 93.5, "fake_feature_count1": 94, "fake_feature_count2": 93, "fake_feature_ratio1": 0.006119791666666667, "fake_feature_ratio2": 0.0060546875, "d_sae1": 15360, "d_sae2": 15360 } }
pythia1
{ "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "avg_fake_feature_ratio": 0.07913643973214285, "avg_fake_feature_count1": 1134.5, "fake_feature_count1": 1059, "fake_feature_count2": 1210, "fake_feature_ratio1": 0.07386997767857142, "fake_feature_ratio2": 0.08440290178571429, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "avg_fake_feature_ratio": 0.07662527901785715, "avg_fake_feature_count1": 1098.5, "fake_feature_count1": 1101, "fake_feature_count2": 1096, "fake_feature_ratio1": 0.07679966517857142, "fake_feature_ratio2": 0.07645089285714286, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_FLAN_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_FLAN_512_140185": { "avg_fake_feature_ratio": 0.07373046875, "avg_fake_feature_count1": 1057.0, "fake_feature_count1": 1057, "fake_feature_count2": 1057, "fake_feature_ratio1": 0.07373046875, "fake_feature_ratio2": 0.07373046875, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pythia-2.8b_synthetic_180k_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pythia-2.8b_synthetic_180k_512_140185": { "avg_fake_feature_ratio": 0.08269391741071427, "avg_fake_feature_count1": 1185.5, "fake_feature_count1": 1172, "fake_feature_count2": 1199, "fake_feature_ratio1": 0.08175223214285714, "fake_feature_ratio2": 0.08363560267857142, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "avg_fake_feature_ratio": 0.07847377232142858, "avg_fake_feature_count1": 1125.0, "fake_feature_count1": 1139, "fake_feature_count2": 1111, "fake_feature_ratio1": 0.07945033482142858, "fake_feature_ratio2": 0.07749720982142858, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_merged_uncensored_alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_merged_uncensored_alpaca_512_140185": { "avg_fake_feature_ratio": 0.08231026785714285, "avg_fake_feature_count1": 1180.0, "fake_feature_count1": 1086, "fake_feature_count2": 1274, "fake_feature_ratio1": 0.07575334821428571, "fake_feature_ratio2": 0.0888671875, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_open-instruct-uncensored-alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_open-instruct-uncensored-alpaca_512_140185": { "avg_fake_feature_ratio": 0.07718331473214285, "avg_fake_feature_count1": 1106.5, "fake_feature_count1": 1148, "fake_feature_count2": 1065, "fake_feature_ratio1": 0.080078125, "fake_feature_ratio2": 0.07428850446428571, "d_sae1": 14336, "d_sae2": 14336 } }
pythia2
{ "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_faithful-pythia1.4b_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_faithful-pythia1.4b_512_156793": { "avg_fake_feature_ratio": 0.005859375, "avg_fake_feature_count1": 90.0, "fake_feature_count1": 90, "fake_feature_count2": 90, "fake_feature_ratio1": 0.005859375, "fake_feature_ratio2": 0.005859375, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "avg_fake_feature_ratio": 0.005566406249999999, "avg_fake_feature_count1": 85.5, "fake_feature_count1": 87, "fake_feature_count2": 84, "fake_feature_ratio1": 0.0056640625, "fake_feature_ratio2": 0.00546875, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_FLAN_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_FLAN_512_156793": { "avg_fake_feature_ratio": 0.0052734375, "avg_fake_feature_count1": 81.0, "fake_feature_count1": 80, "fake_feature_count2": 82, "fake_feature_ratio1": 0.005208333333333333, "fake_feature_ratio2": 0.005338541666666667, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "avg_fake_feature_ratio": 0.00634765625, "avg_fake_feature_count1": 97.5, "fake_feature_count1": 96, "fake_feature_count2": 99, "fake_feature_ratio1": 0.00625, "fake_feature_ratio2": 0.0064453125, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "avg_fake_feature_ratio": 0.006087239583333333, "avg_fake_feature_count1": 93.5, "fake_feature_count1": 94, "fake_feature_count2": 93, "fake_feature_ratio1": 0.006119791666666667, "fake_feature_ratio2": 0.0060546875, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_merged_uncensored_alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_merged_uncensored_alpaca_512_156793": { "avg_fake_feature_ratio": 0.005696614583333334, "avg_fake_feature_count1": 87.5, "fake_feature_count1": 86, "fake_feature_count2": 89, "fake_feature_ratio1": 0.005598958333333333, "fake_feature_ratio2": 0.0057942708333333336, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_open-instruct-uncensored-alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_open-instruct-uncensored-alpaca_512_156793": { "avg_fake_feature_ratio": 0.005729166666666667, "avg_fake_feature_count1": 88.0, "fake_feature_count1": 90, "fake_feature_count2": 86, "fake_feature_ratio1": 0.005859375, "fake_feature_ratio2": 0.005598958333333333, "d_sae1": 15360, "d_sae2": 15360 } }
- gpt2
- llama1
- pythia 1.8
- llama1
- pythia2.8
- llama3
- llama8
below 4~5percent
ChatGPT
A conversational AI system that listens, learns, and challenges
https://chatgpt.com/c/67e1f87e-8740-8007-a3a6-85d14f8e2c16

Seonglae Cho