Faithful SAE Project

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Mar 6 0:49
Editor
Edited
Edited
2025 Jul 11 1:14
Refs
Refs
Done
Done
Done

sae dataset 어떤거 사용해야하는지에 대한 연구는 거의 없었다

notion image
  • More important thing is that feature inspection for model and the main cause is there are a lot of ways to combining feature basis to explain LLM representation.
notion image
You guys can only consider registration if you are interested
  • fake feature
  • downstream - recon acc
sae 는 그냥 선형 조합으로 아무거나 찾는거임 수학적으로 interpretable 하지 않고. 여러 basis 찾는거다 보니 일관성이 없는 것. feature might be flexible if there is cannot found optimized hyperparameters to find true feature set. Correlation among features might make coefficients uninterpretable. L1 regularization might pick up a random feature from a correlated group. Explaining the model ≠ explained the data. Model inspection only provides information about the model, The model might not accurately reflect the data. That means “Interpretability” is unreliable. 결론으로 sae 디스가능 - 이라기보단 그냥 언어 자체가 feature 고정이 아니라 여러 조합으로 표현 가능하다.
Sparsity loss randomly suppress features → This is the main reason of feature matching ratio is sensitive to seed difference. We mis-guess this is due to the dataset’s complexity. Explaining the model Explaining the dataset SAE is sensitive to training dataset but that does not necessarily means that SAE reflects the dataset. This is the biggest wrong assumption we’ve made. More important thing is feature inspection for model and the main cause is there are a lot of ways to combining feature basis to explain LLM representation. SAEs are Highly low reproducibility

Key Challenges

  • 학습된 feature 들이 충분이 interpretable 하고 유용하다는 것 by llm explainer etc
    • synthetic dataset 이 model capability 를 충분히 cover 할정도로 다양하다는 것도 보여야하고. 학습된 feature 들이 충분이 interpretable 하고 유용하다는 것도 보여야할듯
  • fake feature
  • “dataset diversity 가 그냥 좁아서 seed 에 robust 한거 아니냐” 이걸 반박해야함

Rules

  • Bflot 으로 진행해야할듯

해결법

ce difference is most import to insist “faithful” sae
toy model paper 처럼 그냥 mathematicaly efiicient 한 feature dieection 고정분배 해두고 비교하면 interpretability 만 체크하면 안되나 (즉 디코더 고정
feature 중에서 의미있는 feature 가 더 의미있다고 했을때 그중에서 비율 얼마일지, 그것만 했을때 더 의미잇을지 더 검사해보기

Report this week

  • sensitivity to hyperparmater → we should learn from training
  • scaled model is under linear which means still heard to converge and takes time to get activation
  • diversity really matters but the good signal is that ptop p and temperature somehow worked
  • But this is still early stage of training 1e-7 / 1e-8 and 1-9 (gpt2, 4e8 is convergece)
  • Dead neuron convergence matters more

Experiments

  • model size
  • dataset - pretraining synthetic 조합
  • synthetic 으로 만든 topk sae 에서 topk 줄여서 비교
    • 기존 sae 에서 topk 줄여도 synthetic dataset 에서 잘 작동할것으로 예상
      • 다만 online 에서 sparse activated 되기에 실제 쓰는건 대부분 의미있을듯

linear representation theory 근거가 된다

fake feature 가 없다보니 non-linearity 가 적고, eluther embedding 이 더 algitn 되서 높아질 것으로 예상
Write and share readme

Exp 8

Architecture agnostic

Exp 9

using same architecture (pythia)
Every pythia is trained on the pile
  • 12b - synthetic dataset - diverse superset
  • 6b - synthetic dataset - …
  • 2b - synthetic dataset - …
  • 1b - synthetic dataset
Assumption: Bigger model enclose smaller model’s capability
  • We expect if we train SAEs from bigger model’s synthetic data, lower seed robustness
  • Vise versa
  • Detail: we need consistent dictionary size or feature level
 

Models

  • GPT2
  • LLaMa 3.1 3.2
  • Pythia
  • Gemma 2
  • (Mistral)
notion image
  • k16 - 768, 768 * 16
폴만 없었다면 참 분위기가 좋았을텐데, 좋은 팀원들 전체 잃는게 아쉽다. 무슨 말만 해도 발작버튼 눌리니 함부러 맘ㄹ할수 없는 커뮤니케이션 문화 만들고 피해망상 오지니까.
예민하게 받아들이니 나까지 온라인상에서 함부러 못말하겠고 그래서 더 오프라인에서 거칠어지고 분위기를 좋게 못가져가겠다. 이미 싫어져버려서 어쩔수가 없다. 서구 유럽 영미권에서는 뭔가 hate 하면 안되는 분위기인데 나는 그냥 싫다.

April 23rd

mentations that first in the abstract ~ saes are most scable unsupervised way of training interpretable faetures
implemenetaion dataset two differnce sae choice and datasset section
8 pages except limit
 
 

Faithful SAE downstream Proving

Faithful SAE Fake Feature

  • top corellated feature 하나씩 잡고
feat_match 하면서 동시에 진행 전부

실험 설계: 랜덤 노이즈로 Fake Feature 개수 비교

  1. 데이터 준비
      • In-distribution 샘플: 모델이 학습한 도메인(예: Wikipedia, WebText)에서 무작위로 1,000문장 뽑기
      • OOD 노이즈 샘플: 전혀 모델이 본 적 없는 “랜덤 토큰 시퀀스” 1,000문장 생성
    1. Feature 추출
        • Standard SAEFaithful SAE 두 종류에 대해, 각 문장별 token-level SAE activation ft∈Rmf_{t}\in\mathbb R^{m}ft∈Rm → sequence-level aggregation F=∑tftF=\sum_{t}f_{t}F=∑tft
        • 이진화: Fbin[i]=1F[i]>τF_{bin}[i]=\mathbf{1}_{F[i]>τ}Fbin[i]=1F[i]>τ (예: τ=1τ=1τ=1)
    1. Fake Feature 정의
        • Fake Feature: OOD 노이즈에 대해 >p%>p\%>p% (예: p=5%p=5\%p=5%) 문장에서 firing 되는 feature
        • 즉,FakeFeatures={i:10001j=1∑1000Fbin(noisej)[i]>0.05}}
      1. 측정 지표
          • #FakeFeaturesStandard\#\text{FakeFeatures}_{\text{Standard}}#FakeFeaturesStandard vs. #FakeFeaturesFaithful\#\text{FakeFeatures}_{\text{Faithful}}#FakeFeaturesFaithful
          • 가설: Faithful SAE 쪽이 훨씬 적은 Fake Features를 가짐
      1. 통계 검정
          • bootstrapping 으로 feature count 분포 뽑아서 두 그룹 간 p-value 확인

      왜 이게 “Fake Feature” 실험인가?

      • 랜덤 노이즈는 모델이 절대 “합리적” 으로 해석할 수 없는 입력
      • 그럼에도 불구하고 SAE에서 자주 firing 되는 feature들이 “허깨비(fantasy) feature”—즉 진짜 모델 internal concept이 아님
      • Faithful SAE는 OOD 의존 없이 self-generated data로만 학습했기 때문에, 이런 Fake Feature 수가 더 적어야 함
       
       
      Faithful SAE Downstream Provings
      Model
      Training
      F1
      Classification
      Name
      Model
      GPT2
      Training
      Faithful
      F1
      77.29
      Classification
      SST2
      Model
      GPT2
      Training
      Faithful
      F1
      74.78
      Classification
      CoLA
      Model
      GPT2
      Training
      Faithful
      F1
      84.51
      Classification
      AGNews
      Model
      GPT2
      Training
      Faithful
      F1
      85.66
      Classification
      Yelp
      Model
      GPT2
      Training
      Fine
      F1
      83.95
      Classification
      SST2
      Model
      GPT2
      Training
      Fine
      F1
      72.91
      Classification
      CoLA
      Model
      GPT2
      Training
      Fine
      F1
      85.57
      Classification
      AGNews
      Model
      GPT2
      Training
      Fine
      F1
      94.7
      Classification
      Yelp
      Model
      GPT2
      Training
      Pile
      F1
      68.29
      Classification
      SST2
      Model
      GPT2
      Training
      Pile
      F1
      41.15
      Classification
      CoLA
      Model
      GPT2
      Training
      Pile
      F1
      77.08
      Classification
      AGNews
      Model
      GPT2
      Training
      Pile
      F1
      80.99
      Classification
      Yelp
      Model
      Llama1
      Training
      Faithful
      F1
      66.5
      Classification
      SST2
      Model
      Llama1
      Training
      Faithful
      F1
      46.32
      Classification
      CoLA
      Model
      Llama1
      Training
      Faithful
      F1
      78.87
      Classification
      AGNews
      Model
      Llama1
      Training
      Faithful
      F1
      87.05
      Classification
      Yelp
      Model
      Llama1
      Training
      Fine
      F1
      63.96
      Classification
      SST2
      Model
      Llama1
      Training
      Fine
      F1
      47.28
      Classification
      CoLA
      Model
      Llama1
      Training
      Fine
      F1
      79.21
      Classification
      AGNews
      Model
      Llama1
      Training
      Fine
      F1
      87.84
      Classification
      Yelp
      Model
      Llama1
      Training
      Pile
      F1
      65.14
      Classification
      SST2
      Model
      Llama1
      Training
      Pile
      F1
      47.01
      Classification
      CoLA
      Model
      Llama1
      Training
      Pile
      F1
      78.93
      Classification
      AGNews
      Model
      Llama1
      Training
      Pile
      F1
      87.37
      Classification
      Yelp
      Model
      Pythia1
      Training
      F1
      Classification
      Yelp
      Model
      Gemma2
      Training
      F1
      77.29
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      74.78
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Gemma2
      Training
      F1
      Classification
      AGNews
      Model
      Pythia2
      Training
      F1
      Classification
      Model
      Llama3
      Training
      F1
      Classification
      Model
      Llama8
      Training
      F1
      Classification
      Model
      Training
      F1
      Classification
      Model
      Training
      F1
      Classification
      Model
      Training
      F1
      Classification
      yelp 가 gpt ;랑 pythia 이상함
       
       

      Legacy results

      gpt2
      { "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_faithful-gpt2-small_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_faithful-gpt2-small_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6381880733944953, "recon_acc": 0.5819954128440368, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6318821315090541, "recon_f1": 0.5054111627039339, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6936720997123682, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.42136475138804325, "recon_f1": 0.5230834754391598, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7680263157894738, "recon_acc": 0.750328947368421, "baseline_f1": 0.858948305546186, "sae_f1": 0.7686617676668895, "recon_f1": 0.7505566472182452, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8151184210526317, "recon_acc": 0.7999736842105263, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8149999994807476, "recon_f1": 0.7992663404596958, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_fineweb_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_fineweb_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6313073394495412, "recon_acc": 0.5768348623853211, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6190008478699911, "recon_f1": 0.49530534745128274, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6912751677852349, "recon_acc": 0.6812080536912752, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4161159819399749, "recon_f1": 0.5238862923900236, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7672368421052631, "recon_acc": 0.7452631578947368, "baseline_f1": 0.858948305546186, "sae_f1": 0.7677191066746017, "recon_f1": 0.7456838796788295, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8185, "recon_acc": 0.8016184210526316, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8184214690963898, "recon_f1": 0.8004130165513998, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_pile-uncopyrighted_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_pile-uncopyrighted_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6869266055045872, "recon_acc": 0.6009174311926606, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6829182968443854, "recon_f1": 0.5417190294774679, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6907957813998082, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4115262391100577, "recon_f1": 0.5363058957046498, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.77, "recon_acc": 0.7582894736842105, "baseline_f1": 0.858948305546186, "sae_f1": 0.7708296359063063, "recon_f1": 0.7590131643202838, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8099736842105263, "recon_acc": 0.7971578947368421, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8098772521989184, "recon_f1": 0.7963460956997179, } } }
      gemma 2b
      { "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_faithful-gemma2-2b_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_faithful-gemma2-2b_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6639908256880733, "recon_acc": 0.6548165137614679, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6555364833705732, "recon_f1": 0.6463415491665134, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.7018216682646212, "recon_acc": 0.6879194630872483, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5426634423577605, "recon_f1": 0.5362575660160247, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7454605263157894, "recon_acc": 0.7459868421052631, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7471748214266285, "recon_f1": 0.7490910079276412, }, "yelp_polarity": { "baseline_acc": 0.6705, "sae_acc": 0.6401315789473685, "recon_acc": 0.6371842105263158, "baseline_f1": 0.6433888923712144, "sae_f1": 0.6092801899430627, "recon_f1": 0.6062209211703413, } }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_fineweb_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_fineweb_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6754587155963303, "recon_acc": 0.6559633027522935, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6689383392077671, "recon_f1": 0.6451221139029649, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.6936720997123682, "recon_acc": 0.6821668264621285, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5553361529543475, "recon_f1": 0.5383967336884994, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7619736842105262, "recon_acc": 0.7624342105263158, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7646235556915335, "recon_f1": 0.7659606745601961, } } }
      llama 3b
      { "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_faithful-llama3.2-3b_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_faithful-llama3.2-3b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7763761467889908, "recon_acc": 0.8405963302752294, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7756670460950286, "recon_f1": 0.8405587989220263, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7425695110258869, "recon_acc": 0.7708533077660594, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6014255571865801, "recon_f1": 0.6896693761389634, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8459210526315789, "recon_acc": 0.8769736842105263, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8452459442447388, "recon_f1": 0.8769955683104758, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_fineweb_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7620412844036697, "recon_acc": 0.8342889908256881, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7602430376398335, "recon_f1": 0.8340347190661116, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7387344199424737, "recon_acc": 0.7641418983700863, "baseline_f1": 0.7152162539211186, "sae_f1": 0.5946387101656074, "recon_f1": 0.6826465916857021, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8479605263157894, "recon_acc": 0.8742105263157894, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8475250080410112, "recon_f1": 0.8741664410440275, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_pile-uncopyrighted_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7729357798165137, "recon_acc": 0.8297018348623852, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7720794006212012, "recon_f1": 0.8295297429409766, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7416107382550337, "recon_acc": 0.7646212847555129, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6028704698560002, "recon_f1": 0.6822661503487519, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.84875, "recon_acc": 0.8773026315789474, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8482719220887277, "recon_f1": 0.8772401349425806, } } }
      llama8b
      pythia 1.4
      { "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.625, "recon_acc": 0.6628440366972477, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5982676358780978, "recon_f1": 0.6569087986667486, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7176414189837008, "recon_acc": 0.7205177372962608, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5682400889034706, "recon_f1": 0.6250376214968614, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8449342105263158, "recon_acc": 0.8734868421052632, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8447155328402953, "recon_f1": 0.8735220331382518, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.5837155963302751, "recon_acc": 0.6399082568807339, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5252706925463307, "recon_f1": 0.6127023738032431, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7099712368168745, "recon_acc": 0.7157238734419942, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5558493449222908, "recon_f1": 0.6189774843153626, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8411184210526316, "recon_acc": 0.8690789473684211, "baseline_f1": 0.9067620442309687, "sae_f1": 0.840753324361916, "recon_f1": 0.869264166479926, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.6083715596330275, "recon_acc": 0.6634174311926606, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5637099029658558, "recon_f1": 0.6408800808584558, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7142857142857143, "recon_acc": 0.713326941514861, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5691509178971268, "recon_f1": 0.630212930327982, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8497368421052631, "recon_acc": 0.868421052631579, "baseline_f1": 0.9067620442309687, "sae_f1": 0.849209278774073, "recon_f1": 0.8685165182350671, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_FLAN_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_FLAN_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.591743119266055, "recon_acc": 0.6347477064220184, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5330501889425321, "recon_f1": 0.5987975177361987, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7037392138063279, "recon_acc": 0.7056567593480345, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5364699942307098, "recon_f1": 0.5907758660516629, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8282236842105264, "recon_acc": 0.855, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8277841057173874, "recon_f1": 0.8551377247582921, }, "yelp_polarity": { "baseline_acc": 0.9378552631578947, "sae_acc": 0.8199342105263158, "recon_acc": 0.8684342105263159, "baseline_f1": 0.9378443435018236, "sae_f1": 0.8167165640358525, "recon_f1": 0.8682904464532779, } }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_merged_uncensored_alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_merged_uncensored_alpaca_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.5974770642201834, "recon_acc": 0.6198394495412844, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5383769226221773, "recon_f1": 0.580481774864037, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7070949185043145, "recon_acc": 0.7214765100671141, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5463782289363686, "recon_f1": 0.6233436897375952, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.835, "recon_acc": 0.867171052631579, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8343879746772225, "recon_f1": 0.8673697344263738, }, "yelp_polarity": { "baseline_acc": 0.9378552631578947, "sae_acc": 0.8594868421052632, "recon_acc": 0.8900921052631579, "baseline_f1": 0.9378443435018236, "sae_f1": 0.8584010102007948, "recon_f1": 0.8900115172343989, } }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_open-instruct-uncensored-alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_open-instruct-uncensored-alpaca_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.6037844036697249, "recon_acc": 0.6376146788990826, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5554244093038755, "recon_f1": 0.611534593194968, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7080536912751678, "recon_acc": 0.7166826462128475, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5517751045102514, "recon_f1": 0.6279544326917945, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8393421052631579, "recon_acc": 0.8647368421052631, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8386828043318748, "recon_f1": 0.8650205433625378, }, "yelp_polarity": { "baseline_acc": 0.9378552631578947, "sae_acc": 0.8686447368421053, "recon_acc": 0.8956447368421052, "baseline_f1": 0.9378443435018236, "sae_f1": 0.8677390498989366, "recon_f1": 0.8955579157849463, } } }
      llama 1b
      { "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_faithful-llama3.2-1b_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_faithful-llama3.2-1b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6771788990825688, "recon_acc": 0.7069954128440367, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6604980539245782, "recon_f1": 0.7044136847971996, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7056567593480345, "recon_acc": 0.7377756471716204, "baseline_f1": 0.6381572073006643, "sae_f1": 0.46315353559740324, "recon_f1": 0.6049876680314521, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7901315789473684, "recon_acc": 0.8191447368421052, "baseline_f1": 0.860867910090608, "sae_f1": 0.7887296126660821, "recon_f1": 0.8189521248408049, } }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.661697247706422, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6395535445833574, "recon_f1": 0.7227740144313253, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7075743048897412, "recon_acc": 0.736816874400767, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4728249632110051, "recon_f1": 0.6085546284839893, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7933552631578947, "recon_acc": 0.8174342105263158, "baseline_f1": 0.860867910090608, "sae_f1": 0.792079768259315, "recon_f1": 0.8172451481958556, }, }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6680045871559632, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6514118480752982, "recon_f1": 0.7235045413291168, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7070949185043145, "recon_acc": 0.7339405560882071, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4701210319231057, "recon_f1": 0.5992029535466976, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7904605263157896, "recon_acc": 0.8163157894736842, "baseline_f1": 0.860867910090608, "sae_f1": 0.789293579698522, "recon_f1": 0.8164310093089504, } } }
      pythia 2.8
      { "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_faithful-pythia1.4b_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_faithful-pythia1.4b_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6485091743119267, "recon_acc": 0.6674311926605505, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6394634076507092, "recon_f1": 0.6507309664018397, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5023969319271333, "recon_acc": 0.62464046021093, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3266744710043679, "recon_f1": 0.45134193393801403, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5873026315789474, "recon_acc": 0.594078947368421, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5863143504752567, "recon_f1": 0.5964254104882054, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8822105263157896, "recon_acc": 0.8896578947368421, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8821992852293431, "recon_f1": 0.8894904792657169, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6353211009174311, "recon_acc": 0.7138761467889909, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6303691839253754, "recon_f1": 0.7061184907387809, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.6212847555129435, "baseline_f1": 0.3725260677388479, "sae_f1": 0.32386999539827843, "recon_f1": 0.4578079209216317, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6030921052631579, "recon_acc": 0.5969736842105263, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6009567444105985, "recon_f1": 0.6001021008319489, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8942368421052631, "recon_acc": 0.9004736842105263, "baseline_f1": 0.9379692987935441, "sae_f1": 0.894207685428926, "recon_f1": 0.9004008374365153, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_FLAN_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_FLAN_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.5802752293577982, "recon_acc": 0.6496559633027523, "baseline_f1": 0.8818021917383296, "sae_f1": 0.5662361526556061, "recon_f1": 0.638586294582631, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.5517737296260786, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3243620390042899, "recon_f1": 0.4103185016814908, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5373684210526316, "recon_acc": 0.5425, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5356346866398047, "recon_f1": 0.5498615517659742, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8440263157894736, "recon_acc": 0.8671842105263159, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8439853359439617, "recon_f1": 0.8671279564955111, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6559633027522935, "recon_acc": 0.6892201834862386, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6487296756412979, "recon_f1": 0.6835865602014071, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.5800575263662512, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3248497386693457, "recon_f1": 0.43478813762656554, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5732894736842105, "recon_acc": 0.59875, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5731572109943672, "recon_f1": 0.6027152942034848, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8780263157894737, "recon_acc": 0.8898421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8778787915295191, "recon_f1": 0.8897811277723608, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.669151376146789, "recon_acc": 0.661697247706422, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6633291361786984, "recon_f1": 0.6430236708749482, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5023969319271333, "recon_acc": 0.6160115052732502, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3271537002080098, "recon_f1": 0.4583511769225481, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6008552631578947, "recon_acc": 0.6094078947368421, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6005771593515399, "recon_f1": 0.6079249032315457, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8875131578947368, "recon_acc": 0.892421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8874637731864572, "recon_f1": 0.8923647758257278, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_merged_uncensored_alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_merged_uncensored_alpaca_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6227064220183487, "recon_acc": 0.6525229357798166, "baseline_f1": 0.8818021917383296, "sae_f1": 0.5909858500786827, "recon_f1": 0.6211549979390429, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5019175455417066, "recon_acc": 0.5407478427612655, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3259054141650223, "recon_f1": 0.40014348262347194, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5878289473684211, "recon_acc": 0.5875657894736842, "baseline_f1": 0.8468502569930477, "sae_f1": 0.59091706439711, "recon_f1": 0.5927898475190146, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8920921052631579, "recon_acc": 0.8955394736842105, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8920743805631737, "recon_f1": 0.8954434496310095, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_open-instruct-uncensored-alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_open-instruct-uncensored-alpaca_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6668577981651376, "recon_acc": 0.6811926605504588, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6587972733855976, "recon_f1": 0.6715270435207281, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5004793863854267, "recon_acc": 0.552732502396932, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3250463740674675, "recon_f1": 0.4103940652795335, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5691447368421052, "recon_acc": 0.5763815789473684, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5720069778443257, "recon_f1": 0.5804061236468246, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8923026315789474, "recon_acc": 0.8966315789473684, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8923009591141421, "recon_f1": 0.8965514180464571, } } }
       
       
       
      gpt2
      { "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_faithful-gpt2-small_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_faithful-gpt2-small_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6381880733944953, "recon_acc": 0.5819954128440368, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6318821315090541, "recon_f1": 0.5054111627039339, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6936720997123682, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.42136475138804325, "recon_f1": 0.5230834754391598, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7680263157894738, "recon_acc": 0.750328947368421, "baseline_f1": 0.858948305546186, "sae_f1": 0.7686617676668895, "recon_f1": 0.7505566472182452, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8151184210526317, "recon_acc": 0.7999736842105263, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8149999994807476, "recon_f1": 0.7992663404596958, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_fineweb_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_fineweb_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6313073394495412, "recon_acc": 0.5768348623853211, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6190008478699911, "recon_f1": 0.49530534745128274, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6912751677852349, "recon_acc": 0.6812080536912752, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4161159819399749, "recon_f1": 0.5238862923900236, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.7672368421052631, "recon_acc": 0.7452631578947368, "baseline_f1": 0.858948305546186, "sae_f1": 0.7677191066746017, "recon_f1": 0.7456838796788295, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8185, "recon_acc": 0.8016184210526316, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8184214690963898, "recon_f1": 0.8004130165513998, } }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_pile-uncopyrighted_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_pile-uncopyrighted_128_24413": { "stanfordnlp/sst2": { "baseline_acc": 0.7603211009174312, "sae_acc": 0.6869266055045872, "recon_acc": 0.6009174311926606, "baseline_f1": 0.7436603001327117, "sae_f1": 0.6829182968443854, "recon_f1": 0.5417190294774679, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7080536912751678, "sae_acc": 0.6907957813998082, "recon_acc": 0.6845637583892618, "baseline_f1": 0.6088007739109902, "sae_f1": 0.4115262391100577, "recon_f1": 0.5363058957046498, }, "ag_news": { "baseline_acc": 0.8586184210526315, "sae_acc": 0.77, "recon_acc": 0.7582894736842105, "baseline_f1": 0.858948305546186, "sae_f1": 0.7708296359063063, "recon_f1": 0.7590131643202838, }, "yelp_polarity": { "baseline_acc": 0.8873815789473685, "sae_acc": 0.8099736842105263, "recon_acc": 0.7971578947368421, "baseline_f1": 0.8872207883759649, "sae_f1": 0.8098772521989184, "recon_f1": 0.7963460956997179, } } }
      gemma 2b
      { "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_faithful-gemma2-2b_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_faithful-gemma2-2b_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6639908256880733, "recon_acc": 0.6548165137614679, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6555364833705732, "recon_f1": 0.6463415491665134, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.7018216682646212, "recon_acc": 0.6879194630872483, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5426634423577605, "recon_f1": 0.5362575660160247, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7454605263157894, "recon_acc": 0.7459868421052631, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7471748214266285, "recon_f1": 0.7490910079276412, }, "yelp_polarity": { "baseline_acc": 0.6705, "sae_acc": 0.6401315789473685, "recon_acc": 0.6371842105263158, "baseline_f1": 0.6433888923712144, "sae_f1": 0.6092801899430627, "recon_f1": 0.6062209211703413, } }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_fineweb_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_fineweb_1024_9764": { "stanfordnlp/sst2": { "baseline_acc": 0.7012614678899083, "sae_acc": 0.6754587155963303, "recon_acc": 0.6559633027522935, "baseline_f1": 0.6932816931945159, "sae_f1": 0.6689383392077671, "recon_f1": 0.6451221139029649, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7023010546500479, "sae_acc": 0.6936720997123682, "recon_acc": 0.6821668264621285, "baseline_f1": 0.5791130763716196, "sae_f1": 0.5553361529543475, "recon_f1": 0.5383967336884994, }, "ag_news": { "baseline_acc": 0.7902631578947368, "sae_acc": 0.7619736842105262, "recon_acc": 0.7624342105263158, "baseline_f1": 0.7930319932458574, "sae_f1": 0.7646235556915335, "recon_f1": 0.7659606745601961, } } }
      llama 3b
      { "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_faithful-llama3.2-3b_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_faithful-llama3.2-3b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7763761467889908, "recon_acc": 0.8405963302752294, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7756670460950286, "recon_f1": 0.8405587989220263, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7425695110258869, "recon_acc": 0.7708533077660594, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6014255571865801, "recon_f1": 0.6896693761389634, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8459210526315789, "recon_acc": 0.8769736842105263, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8452459442447388, "recon_f1": 0.8769955683104758, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_fineweb_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7620412844036697, "recon_acc": 0.8342889908256881, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7602430376398335, "recon_f1": 0.8340347190661116, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7387344199424737, "recon_acc": 0.7641418983700863, "baseline_f1": 0.7152162539211186, "sae_f1": 0.5946387101656074, "recon_f1": 0.6826465916857021, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.8479605263157894, "recon_acc": 0.8742105263157894, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8475250080410112, "recon_f1": 0.8741664410440275, } }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_pile-uncopyrighted_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.9105504587155964, "sae_acc": 0.7729357798165137, "recon_acc": 0.8297018348623852, "baseline_f1": 0.9104974881110428, "sae_f1": 0.7720794006212012, "recon_f1": 0.8295297429409766, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7833173537871525, "sae_acc": 0.7416107382550337, "recon_acc": 0.7646212847555129, "baseline_f1": 0.7152162539211186, "sae_f1": 0.6028704698560002, "recon_f1": 0.6822661503487519, }, "ag_news": { "baseline_acc": 0.9086842105263158, "sae_acc": 0.84875, "recon_acc": 0.8773026315789474, "baseline_f1": 0.9086109430640693, "sae_f1": 0.8482719220887277, "recon_f1": 0.8772401349425806, } } }
      pythia 1.4
      { "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.625, "recon_acc": 0.6628440366972477, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5982676358780978, "recon_f1": 0.6569087986667486, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7176414189837008, "recon_acc": 0.7205177372962608, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5682400889034706, "recon_f1": 0.6250376214968614, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8449342105263158, "recon_acc": 0.8734868421052632, "baseline_f1": 0.9067620442309687, "sae_f1": 0.8447155328402953, "recon_f1": 0.8735220331382518, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.5837155963302751, "recon_acc": 0.6399082568807339, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5252706925463307, "recon_f1": 0.6127023738032431, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7099712368168745, "recon_acc": 0.7157238734419942, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5558493449222908, "recon_f1": 0.6189774843153626, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8411184210526316, "recon_acc": 0.8690789473684211, "baseline_f1": 0.9067620442309687, "sae_f1": 0.840753324361916, "recon_f1": 0.869264166479926, }, }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "stanfordnlp/sst2": { "baseline_acc": 0.8279816513761468, "sae_acc": 0.6083715596330275, "recon_acc": 0.6634174311926606, "baseline_f1": 0.8263198150728646, "sae_f1": 0.5637099029658558, "recon_f1": 0.6408800808584558, }, "nyu-mll/glue/cola": { "baseline_acc": 0.74784276126558, "sae_acc": 0.7142857142857143, "recon_acc": 0.713326941514861, "baseline_f1": 0.6584783651460901, "sae_f1": 0.5691509178971268, "recon_f1": 0.630212930327982, }, "ag_news": { "baseline_acc": 0.9068421052631579, "sae_acc": 0.8497368421052631, "recon_acc": 0.868421052631579, "baseline_f1": 0.9067620442309687, "sae_f1": 0.849209278774073, "recon_f1": 0.8685165182350671, }, } }
      llama 1b
      { "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_faithful-llama3.2-1b_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_faithful-llama3.2-1b_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6771788990825688, "recon_acc": 0.7069954128440367, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6604980539245782, "recon_f1": 0.7044136847971996, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7056567593480345, "recon_acc": 0.7377756471716204, "baseline_f1": 0.6381572073006643, "sae_f1": 0.46315353559740324, "recon_f1": 0.6049876680314521, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7901315789473684, "recon_acc": 0.8191447368421052, "baseline_f1": 0.860867910090608, "sae_f1": 0.7887296126660821, "recon_f1": 0.8189521248408049, } }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.661697247706422, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6395535445833574, "recon_f1": 0.7227740144313253, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7075743048897412, "recon_acc": 0.736816874400767, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4728249632110051, "recon_f1": 0.6085546284839893, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7933552631578947, "recon_acc": 0.8174342105263158, "baseline_f1": 0.860867910090608, "sae_f1": 0.792079768259315, "recon_f1": 0.8172451481958556, }, }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_195311": { "stanfordnlp/sst2": { "baseline_acc": 0.8010321100917431, "sae_acc": 0.6680045871559632, "recon_acc": 0.7247706422018348, "baseline_f1": 0.8001009384586332, "sae_f1": 0.6514118480752982, "recon_f1": 0.7235045413291168, }, "nyu-mll/glue/cola": { "baseline_acc": 0.7526366251198466, "sae_acc": 0.7070949185043145, "recon_acc": 0.7339405560882071, "baseline_f1": 0.6381572073006643, "sae_f1": 0.4701210319231057, "recon_f1": 0.5992029535466976, }, "ag_news": { "baseline_acc": 0.8612500000000001, "sae_acc": 0.7904605263157896, "recon_acc": 0.8163157894736842, "baseline_f1": 0.860867910090608, "sae_f1": 0.789293579698522, "recon_f1": 0.8164310093089504, } } }
      pythia 2.8
      { "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6559633027522935, "recon_acc": 0.6892201834862386, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6487296756412979, "recon_f1": 0.6835865602014071, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.5800575263662512, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3248497386693457, "recon_f1": 0.43478813762656554, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.5732894736842105, "recon_acc": 0.59875, "baseline_f1": 0.8468502569930477, "sae_f1": 0.5731572109943672, "recon_f1": 0.6027152942034848, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8780263157894737, "recon_acc": 0.8898421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8778787915295191, "recon_f1": 0.8897811277723608, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.6353211009174311, "recon_acc": 0.7138761467889909, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6303691839253754, "recon_f1": 0.7061184907387809, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5009587727708533, "recon_acc": 0.6212847555129435, "baseline_f1": 0.3725260677388479, "sae_f1": 0.32386999539827843, "recon_f1": 0.4578079209216317, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6030921052631579, "recon_acc": 0.5969736842105263, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6009567444105985, "recon_f1": 0.6001021008319489, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8942368421052631, "recon_acc": 0.9004736842105263, "baseline_f1": 0.9379692987935441, "sae_f1": 0.894207685428926, "recon_f1": 0.9004008374365153, } }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "stanfordnlp/sst2": { "baseline_acc": 0.8818807339449541, "sae_acc": 0.669151376146789, "recon_acc": 0.661697247706422, "baseline_f1": 0.8818021917383296, "sae_f1": 0.6633291361786984, "recon_f1": 0.6430236708749482, }, "nyu-mll/glue/cola": { "baseline_acc": 0.5239693192713327, "sae_acc": 0.5023969319271333, "recon_acc": 0.6160115052732502, "baseline_f1": 0.3725260677388479, "sae_f1": 0.3271537002080098, "recon_f1": 0.4583511769225481, }, "ag_news": { "baseline_acc": 0.8474999999999999, "sae_acc": 0.6008552631578947, "recon_acc": 0.6094078947368421, "baseline_f1": 0.8468502569930477, "sae_f1": 0.6005771593515399, "recon_f1": 0.6079249032315457, }, "yelp_polarity": { "baseline_acc": 0.938, "sae_acc": 0.8875131578947368, "recon_acc": 0.892421052631579, "baseline_f1": 0.9379692987935441, "sae_f1": 0.8874637731864572, "recon_f1": 0.8923647758257278, } } }
       
      ❯ python analyze_data.py Average sae_acc and sae_f1 across models and tasks: faithful: sae_acc=0.7042, sae_f1=0.6347 fineweb: sae_acc=0.7009, sae_f1=0.6298 pile-uncopyrighted: sae_acc=0.7056, sae_f1=0.6342 Pairwise comparisons: faithful vs fineweb: Average acc diff: 0.0034 (10/15 wins) Average f1 diff: 0.0049 (9/15 wins) faithful vs pile-uncopyrighted: Average acc diff: -0.0014 (7/15 wins) Average f1 diff: 0.0005 (6/15 wins) fineweb vs faithful: Average acc diff: -0.0034 (5/15 wins) Average f1 diff: -0.0049 (6/15 wins) fineweb vs pile-uncopyrighted: Average acc diff: -0.0047 (7/15 wins) Average f1 diff: -0.0044 (7/15 wins) pile-uncopyrighted vs faithful: Average acc diff: 0.0014 (7/15 wins) Average f1 diff: -0.0005 (9/15 wins) pile-uncopyrighted vs fineweb: Average acc diff: 0.0047 (8/15 wins) Average f1 diff: 0.0044 (8/15 wins) Analysis by model: GPT2: faithful: sae_acc=0.7000, sae_f1=0.6073 fineweb: sae_acc=0.6966, sae_f1=0.6009 pile: sae_acc=0.7159, sae_f1=0.6218 Llama-3B: faithful: sae_acc=0.7883, sae_f1=0.7408 fineweb: sae_acc=0.7829, sae_f1=0.7341 pile: sae_acc=0.7878, sae_f1=0.7411 Llama-1B: faithful: sae_acc=0.7243, sae_f1=0.6375 fineweb: sae_acc=0.7209, sae_f1=0.6348 pile: sae_acc=0.7219, sae_f1=0.6369 Pythia-1.4B: faithful: sae_acc=0.7292, sae_f1=0.6704 fineweb: sae_acc=0.7241, sae_f1=0.6607 pile: sae_acc=0.7116, sae_f1=0.6406 Pythia-2.8B: faithful: sae_acc=0.5794, sae_f1=0.5175 fineweb: sae_acc=0.5798, sae_f1=0.5184 pile: sae_acc=0.5908, sae_f1=0.5304
       
       
       

      Fake feature

      gemma 2b
      { "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_faithful-gemma2-2b_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_faithful-gemma2-2b_1024_9764": { "avg_fake_feature_ratio": 0.006564670138888889, "avg_fake_feature_count1": 121.0, "fake_feature_count1": 120, "fake_feature_count2": 122, "fake_feature_ratio1": 0.006510416666666667, "fake_feature_ratio2": 0.006618923611111111, "d_sae1": 18432, "d_sae2": 18432 }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_fineweb_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_fineweb_1024_9764": { "avg_fake_feature_ratio": 0.007161458333333333, "avg_fake_feature_count1": 132.0, "fake_feature_count1": 127, "fake_feature_count2": 137, "fake_feature_ratio1": 0.006890190972222222, "fake_feature_ratio2": 0.007432725694444444, "d_sae1": 18432, "d_sae2": 18432 }, "gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_42_pile-uncopyrighted_1024_9764-gemma-2-2b_blocks.20.hook_resid_pre_18432_topk_64_0.0003_49_pile-uncopyrighted_1024_9764": { "avg_fake_feature_ratio": 0.006673177083333333, "avg_fake_feature_count1": 123.0, "fake_feature_count1": 118, "fake_feature_count2": 128, "fake_feature_ratio1": 0.006401909722222222, "fake_feature_ratio2": 0.006944444444444444, "d_sae1": 18432, "d_sae2": 18432 } }
      gpt2
      { "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_faithful-gpt2-small_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_faithful-gpt2-small_128_24413": { "avg_fake_feature_ratio": 0.003011067708333333, "avg_fake_feature_count1": 37.0, "fake_feature_count1": 36, "fake_feature_count2": 38, "fake_feature_ratio1": 0.0029296875, "fake_feature_ratio2": 0.0030924479166666665, "d_sae1": 12288, "d_sae2": 12288 }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_fineweb_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_fineweb_128_24413": { "avg_fake_feature_ratio": 0.002726236979166667, "avg_fake_feature_count1": 33.5, "fake_feature_count1": 34, "fake_feature_count2": 33, "fake_feature_ratio1": 0.0027669270833333335, "fake_feature_ratio2": 0.002685546875, "d_sae1": 12288, "d_sae2": 12288 }, "gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_42_pile-uncopyrighted_128_24413-gpt2-small_blocks.8.hook_resid_pre_12288_topk_16_0.0003_49_pile-uncopyrighted_128_24413": { "avg_fake_feature_ratio": 0.002644856770833333, "avg_fake_feature_count1": 32.5, "fake_feature_count1": 33, "fake_feature_count2": 32, "fake_feature_ratio1": 0.002685546875, "fake_feature_ratio2": 0.0026041666666666665, "d_sae1": 12288, "d_sae2": 12288 } }
      llama1
      { "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_faithful-llama3.2-1b_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_faithful-llama3.2-1b_512_195311": { "avg_fake_feature_ratio": 0.007882254464285714, "avg_fake_feature_count1": 113.0, "fake_feature_count1": 120, "fake_feature_count2": 106, "fake_feature_ratio1": 0.008370535714285714, "fake_feature_ratio2": 0.007393973214285714, "d_sae1": 14336, "d_sae2": 14336 }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_195311": { "avg_fake_feature_ratio": 0.007045200892857143, "avg_fake_feature_count1": 101.0, "fake_feature_count1": 101, "fake_feature_count2": 101, "fake_feature_ratio1": 0.007045200892857143, "fake_feature_ratio2": 0.007045200892857143, "d_sae1": 14336, "d_sae2": 14336 }, "Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_195311-Llama-3.2-1B_blocks.12.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_195311": { "avg_fake_feature_ratio": 0.007463727678571428, "avg_fake_feature_count1": 107.0, "fake_feature_count1": 109, "fake_feature_count2": 105, "fake_feature_ratio1": 0.007603236607142857, "fake_feature_ratio2": 0.00732421875, "d_sae1": 14336, "d_sae2": 14336 } }
      llama3
      { "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_faithful-llama3.2-3b_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_faithful-llama3.2-3b_512_195311": { "avg_fake_feature_ratio": 0.14591471354166669, "avg_fake_feature_count1": 2689.5, "fake_feature_count1": 2652, "fake_feature_count2": 2727, "fake_feature_ratio1": 0.14388020833333334, "fake_feature_ratio2": 0.14794921875, "d_sae1": 18432, "d_sae2": 18432 }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_fineweb_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_fineweb_512_195311": { "avg_fake_feature_ratio": 0.13460286458333331, "avg_fake_feature_count1": 2481.0, "fake_feature_count1": 2536, "fake_feature_count2": 2426, "fake_feature_ratio1": 0.13758680555555555, "fake_feature_ratio2": 0.1316189236111111, "d_sae1": 18432, "d_sae2": 18432 }, "Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_42_pile-uncopyrighted_512_195311-Llama-3.2-3B_blocks.21.hook_resid_pre_18432_topk_64_0.0001_49_pile-uncopyrighted_512_195311": { "avg_fake_feature_ratio": 0.14176432291666669, "avg_fake_feature_count1": 2613.0, "fake_feature_count1": 2599, "fake_feature_count2": 2627, "fake_feature_ratio1": 0.14100477430555555, "fake_feature_ratio2": 0.1425238715277778, "d_sae1": 18432, "d_sae2": 18432 } }
      llama8
      { "Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_42_faithful-llama3.1-8b_512_292967-Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_49_faithful-llama3.1-8b_512_292967": { "avg_fake_feature_ratio": 0.036041259765625, "avg_fake_feature_count1": 590.5, "fake_feature_count1": 605, "fake_feature_count2": 576, "fake_feature_ratio1": 0.03692626953125, "fake_feature_ratio2": 0.03515625, "d_sae1": 16384, "d_sae2": 16384 }, "Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_42_fineweb_512_292967-Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_49_fineweb_512_292967": { "avg_fake_feature_ratio": 0.03955078125, "avg_fake_feature_count1": 648.0, "fake_feature_count1": 643, "fake_feature_count2": 653, "fake_feature_ratio1": 0.03924560546875, "fake_feature_ratio2": 0.03985595703125, "d_sae1": 16384, "d_sae2": 16384 }, "Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_42_pile-uncopyrighted_512_292967-Llama-3.1-8B_blocks.24.hook_resid_pre_16384_topk_80_6e-05_49_pile-uncopyrighted_512_292967": { "avg_fake_feature_ratio": 0.040130615234375, "avg_fake_feature_count1": 657.5, "fake_feature_count1": 663, "fake_feature_count2": 652, "fake_feature_ratio1": 0.04046630859375, "fake_feature_ratio2": 0.039794921875, "d_sae1": 16384, "d_sae2": 16384 } }
      pythia1
      { "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "avg_fake_feature_ratio": 0.07913643973214285, "avg_fake_feature_count1": 1134.5, "fake_feature_count1": 1059, "fake_feature_count2": 1210, "fake_feature_ratio1": 0.07386997767857142, "fake_feature_ratio2": 0.08440290178571429, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "avg_fake_feature_ratio": 0.07662527901785715, "avg_fake_feature_count1": 1098.5, "fake_feature_count1": 1101, "fake_feature_count2": 1096, "fake_feature_ratio1": 0.07679966517857142, "fake_feature_ratio2": 0.07645089285714286, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "avg_fake_feature_ratio": 0.07847377232142858, "avg_fake_feature_count1": 1125.0, "fake_feature_count1": 1139, "fake_feature_count2": 1111, "fake_feature_ratio1": 0.07945033482142858, "fake_feature_ratio2": 0.07749720982142858, "d_sae1": 14336, "d_sae2": 14336 } }
      pythia2
      { "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "avg_fake_feature_ratio": 0.005566406249999999, "avg_fake_feature_count1": 85.5, "fake_feature_count1": 87, "fake_feature_count2": 84, "fake_feature_ratio1": 0.0056640625, "fake_feature_ratio2": 0.00546875, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "avg_fake_feature_ratio": 0.00634765625, "avg_fake_feature_count1": 97.5, "fake_feature_count1": 96, "fake_feature_count2": 99, "fake_feature_ratio1": 0.00625, "fake_feature_ratio2": 0.0064453125, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "avg_fake_feature_ratio": 0.006087239583333333, "avg_fake_feature_count1": 93.5, "fake_feature_count1": 94, "fake_feature_count2": 93, "fake_feature_ratio1": 0.006119791666666667, "fake_feature_ratio2": 0.0060546875, "d_sae1": 15360, "d_sae2": 15360 } }
       
       
       
      pythia1
      { "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_faithful-pythia1.4b_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_faithful-pythia1.4b_512_140185": { "avg_fake_feature_ratio": 0.07913643973214285, "avg_fake_feature_count1": 1134.5, "fake_feature_count1": 1059, "fake_feature_count2": 1210, "fake_feature_ratio1": 0.07386997767857142, "fake_feature_ratio2": 0.08440290178571429, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_fineweb_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_fineweb_512_140185": { "avg_fake_feature_ratio": 0.07662527901785715, "avg_fake_feature_count1": 1098.5, "fake_feature_count1": 1101, "fake_feature_count2": 1096, "fake_feature_ratio1": 0.07679966517857142, "fake_feature_ratio2": 0.07645089285714286, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_FLAN_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_FLAN_512_140185": { "avg_fake_feature_ratio": 0.07373046875, "avg_fake_feature_count1": 1057.0, "fake_feature_count1": 1057, "fake_feature_count2": 1057, "fake_feature_ratio1": 0.07373046875, "fake_feature_ratio2": 0.07373046875, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pythia-2.8b_synthetic_180k_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pythia-2.8b_synthetic_180k_512_140185": { "avg_fake_feature_ratio": 0.08269391741071427, "avg_fake_feature_count1": 1185.5, "fake_feature_count1": 1172, "fake_feature_count2": 1199, "fake_feature_ratio1": 0.08175223214285714, "fake_feature_ratio2": 0.08363560267857142, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_pile-uncopyrighted_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_pile-uncopyrighted_512_140185": { "avg_fake_feature_ratio": 0.07847377232142858, "avg_fake_feature_count1": 1125.0, "fake_feature_count1": 1139, "fake_feature_count2": 1111, "fake_feature_ratio1": 0.07945033482142858, "fake_feature_ratio2": 0.07749720982142858, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_merged_uncensored_alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_merged_uncensored_alpaca_512_140185": { "avg_fake_feature_ratio": 0.08231026785714285, "avg_fake_feature_count1": 1180.0, "fake_feature_count1": 1086, "fake_feature_count2": 1274, "fake_feature_ratio1": 0.07575334821428571, "fake_feature_ratio2": 0.0888671875, "d_sae1": 14336, "d_sae2": 14336 }, "pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_42_open-instruct-uncensored-alpaca_512_140185-pythia-1.4b_blocks.18.hook_resid_pre_14336_topk_48_0.0002_49_open-instruct-uncensored-alpaca_512_140185": { "avg_fake_feature_ratio": 0.07718331473214285, "avg_fake_feature_count1": 1106.5, "fake_feature_count1": 1148, "fake_feature_count2": 1065, "fake_feature_ratio1": 0.080078125, "fake_feature_ratio2": 0.07428850446428571, "d_sae1": 14336, "d_sae2": 14336 } }
      pythia2
      { "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_faithful-pythia1.4b_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_faithful-pythia1.4b_512_156793": { "avg_fake_feature_ratio": 0.005859375, "avg_fake_feature_count1": 90.0, "fake_feature_count1": 90, "fake_feature_count2": 90, "fake_feature_ratio1": 0.005859375, "fake_feature_ratio2": 0.005859375, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_fineweb_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_fineweb_512_156793": { "avg_fake_feature_ratio": 0.005566406249999999, "avg_fake_feature_count1": 85.5, "fake_feature_count1": 87, "fake_feature_count2": 84, "fake_feature_ratio1": 0.0056640625, "fake_feature_ratio2": 0.00546875, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_FLAN_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_FLAN_512_156793": { "avg_fake_feature_ratio": 0.0052734375, "avg_fake_feature_count1": 81.0, "fake_feature_count1": 80, "fake_feature_count2": 82, "fake_feature_ratio1": 0.005208333333333333, "fake_feature_ratio2": 0.005338541666666667, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pythia-2.8b_synthetic_180k_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pythia-2.8b_synthetic_180k_512_156793": { "avg_fake_feature_ratio": 0.00634765625, "avg_fake_feature_count1": 97.5, "fake_feature_count1": 96, "fake_feature_count2": 99, "fake_feature_ratio1": 0.00625, "fake_feature_ratio2": 0.0064453125, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_pile-uncopyrighted_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_pile-uncopyrighted_512_156793": { "avg_fake_feature_ratio": 0.006087239583333333, "avg_fake_feature_count1": 93.5, "fake_feature_count1": 94, "fake_feature_count2": 93, "fake_feature_ratio1": 0.006119791666666667, "fake_feature_ratio2": 0.0060546875, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_merged_uncensored_alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_merged_uncensored_alpaca_512_156793": { "avg_fake_feature_ratio": 0.005696614583333334, "avg_fake_feature_count1": 87.5, "fake_feature_count1": 86, "fake_feature_count2": 89, "fake_feature_ratio1": 0.005598958333333333, "fake_feature_ratio2": 0.0057942708333333336, "d_sae1": 15360, "d_sae2": 15360 }, "pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_42_open-instruct-uncensored-alpaca_512_156793-pythia-2.8b_blocks.24.hook_resid_pre_15360_topk_64_0.0001_49_open-instruct-uncensored-alpaca_512_156793": { "avg_fake_feature_ratio": 0.005729166666666667, "avg_fake_feature_count1": 88.0, "fake_feature_count1": 90, "fake_feature_count2": 86, "fake_feature_ratio1": 0.005859375, "fake_feature_ratio2": 0.005598958333333333, "d_sae1": 15360, "d_sae2": 15360 } }
       
       
       
       
      • gpt2
      • llama1
      • pythia 1.8
      • llama1
      • pythia2.8
      • llama3
      • llama8
       
       
       
       
      below 4~5percent
      ChatGPT
      A conversational AI system that listens, learns, and challenges
      ChatGPT
       
       

       

      Recommendations