BBQ
- mmlu 보다는 bbq 상승에 훨씬 효율적이다
Gemma
gemma2b_bbq_17_ppo_1e-05_0725_180548_50.0_select 로 critic 이랑 feature 검사다시
Baseline
Disambiguous
- white paper - 83.20 1-shot
- baseline 75.31%
- few 1 - 84.83%, 84.01%, 84.77 → 84.95 5th
- 17th 83.96% -50
- few 5
- 17th 83.59% -50
- 17th 76.73
- loss normal
- 77.04%
- corrsteer
- 86.18% few layer?
- 85.84% global few
- 75.70% not few layer
- 76.53% global now few
Ambiguous
- white paper - 69.31
- baseline 59.41%
- new few 60.16% → 63.71% 5th
- few 80.52% (1 but highly depends on the example)
- 1 - 63.70%
- 5 - 72.81%
- select -
- corrsteer
- 62.08% zero global
- layer 62.38%
- few 64.98% global
- foreach 66.65%
- 21th 61.88%
- policy/critic both deep
- loss softmax
Domain specific
genderIdentity
39.26% {235248: 369, 586: 1333, 585: 1180, 599: 1046, 5231: 43} → 38.08%
43.57% {586: 1897, 599: 1059, 585: 1004, 108: 1, 5231: 5, 235248: 5} 20 20
41.80% {585: 1371, 586: 1457, 599: 1088, 139: 54, 235248: 1} 24 20
baseline select {585: 1350, 586: 1486, 599: 1135} : 42.86%
20th select {586: 1904, 599: 1060, 585: 1007} 43.72%
Baseline
- 39.26% baseline without selection
- 42.86% baseline with selection
Single layer
- 43.72% for 20th layer with selection
- 45.00% for 20th layer without selection
- 42.08% for 24th layer with selection
Neuronpedia (not single token feature)
- 17th 15253
- +50 disambig 70.58%
- -50 disambig 78.89%
- 85.96% few1
- +50 ambig 69.62
- -50 ambig 50.68
- 23th 1469
- 15th 13195
- 20th 1810
- 25th 13753
Neuronpedia
Open Source Interpretability Platform
https://www.neuronpedia.org/search-explanations/?modelId=gemma-2-2b&q=bias
LLama
baseline
- ambig 23.01% zero shot
- 1 shot 84.08% → 85.04% 7th
- disambig 78.69% zero shot
- 1 shot 90.07% → 90.16 30th
ambig
- 34.87 7th -5
- 31.92% corrsteer
disambig
- 79.20 10th -5
Corrsteer
- ambig 0 61.10
Seonglae Cho