CRL BBQ

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 May 2 0:51
Editor
Edited
Edited
2025 Aug 12 21:18
Refs

BBQ

  • mmlu 보다는 bbq 상승에 훨씬 효율적이다

Gemma

gemma2b_bbq_17_ppo_1e-05_0725_180548_50.0_select 로 critic 이랑 feature 검사다시

Baseline

Disambiguous

  • white paper - 83.20 1-shot
  • baseline 75.31%
  • few 1 - 84.83%, 84.01%, 84.77 → 84.95 5th
    • 17th 83.96% -50
  • few 5
    • 17th 83.59% -50
  • 17th 76.73
    • loss normal
  • 77.04%
  • corrsteer
    • 86.18% few layer?
    • 85.84% global few
    • 75.70% not few layer
    • 76.53% global now few

Ambiguous

  • white paper - 69.31
  • baseline 59.41%
  • new few 60.16% → 63.71% 5th
  • few 80.52% (1 but highly depends on the example)
    • 1 - 63.70%
    • 5 - 72.81%
  • select -
  • corrsteer
    • 62.08% zero global
    • layer 62.38%
    • few 64.98% global
    • foreach 66.65%
  • 21th 61.88%
    • policy/critic both deep
    • loss softmax

Domain specific

genderIdentity
39.26% {235248: 369, 586: 1333, 585: 1180, 599: 1046, 5231: 43} → 38.08%
43.57% {586: 1897, 599: 1059, 585: 1004, 108: 1, 5231: 5, 235248: 5} 20 20
41.80% {585: 1371, 586: 1457, 599: 1088, 139: 54, 235248: 1} 24 20
baseline select {585: 1350, 586: 1486, 599: 1135} : 42.86%
20th select {586: 1904, 599: 1060, 585: 1007} 43.72%

Baseline

  • 39.26% baseline without selection
  • 42.86% baseline with selection

Single layer

  • 43.72% for 20th layer with selection
  • 45.00% for 20th layer without selection
  • 42.08% for 24th layer with selection
 
 

Neuronpedia (not single token feature)

  • 17th 15253
    • +50 disambig 70.58%
    • -50 disambig 78.89%
      • 85.96% few1
    • +50 ambig 69.62
    • -50 ambig 50.68
  • 23th 1469
  • 15th 13195
  • 20th 1810
  • 25th 13753
 
 
Neuronpedia
Open Source Interpretability Platform
 

LLama

baseline

  • ambig 23.01% zero shot
    • 1 shot 84.08% → 85.04% 7th
  • disambig 78.69% zero shot
    • 1 shot 90.07% → 90.16 30th

ambig

  • 34.87 7th -5
    • 31.92% corrsteer

disambig

  • 79.20 10th -5

    Corrsteer

    • ambig 0 61.10
     
     
     
     
     

    Recommendations