CRL BBQ

BBQ

mmlu 보다는 bbq 상승에 훨씬 효율적이다

Gemma

gemma2b_bbq_17_ppo_1e-05_0725_180548_50.0_select 로 critic 이랑 feature 검사다시

Baseline

Disambiguous

white paper - 83.20 1-shot

baseline 75.31%

few 1 - 84.83%, 84.01%, 84.77 → 84.95 5th

17th 83.96% -50

few 5

17th 83.59% -50

17th 76.73

loss normal

77.04%

corrsteer

86.18% few layer?
85.84% global few
75.70% not few layer
76.53% global now few

Ambiguous

white paper - 69.31

baseline 59.41%

new few 60.16% → 63.71% 5th

few 80.52% (1 but highly depends on the example)

1 - 63.70%
5 - 72.81%

select -

corrsteer

62.08% zero global
layer 62.38%
few 64.98% global
foreach 66.65%

21th 61.88%

policy/critic both deep
loss softmax

Domain specific

genderIdentity

39.26% {235248: 369, 586: 1333, 585: 1180, 599: 1046, 5231: 43} → 38.08%

43.57% {586: 1897, 599: 1059, 585: 1004, 108: 1, 5231: 5, 235248: 5} 20 20

41.80% {585: 1371, 586: 1457, 599: 1088, 139: 54, 235248: 1} 24 20

baseline select {585: 1350, 586: 1486, 599: 1135} : 42.86%

20th select {586: 1904, 599: 1060, 585: 1007} 43.72%

Baseline

39.26% baseline without selection

42.86% baseline with selection

Single layer

43.72% for 20th layer with selection

45.00% for 20th layer without selection

42.08% for 24th layer with selection

Neuronpedia (not single token feature)

17th 15253

+50 disambig 70.58%
-50 disambig 78.89%

85.96% few1

+50 ambig 69.62
-50 ambig 50.68

23th 1469

15th 13195

20th 1810

25th 13753

Neuronpedia

Open Source Interpretability Platform

https://www.neuronpedia.org/search-explanations/?modelId=gemma-2-2b&q=bias

LLama

baseline

ambig 23.01% zero shot

1 shot 84.08% → 85.04% 7th

disambig 78.69% zero shot

1 shot 90.07% → 90.16 30th

ambig

34.87 7th -5

31.92% corrsteer

disambig

79.20 10th -5

Corrsteer

ambig 0 61.10

CRL BBQ

BBQ

Gemma

Baseline

Disambiguous

Ambiguous

Domain specific

Baseline

Single layer

Neuronpedia (not single token feature)

LLama

baseline

ambig

disambig

Corrsteer

Recommendations