CRL HarmBench

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jul 23 23:2
Editor
Edited
Edited
2025 Aug 12 13:10
Refs
Refs

Gemma

Baseline

  • keyword based 35.36%
  • rejection detection model based 44.64%

21th

  • default 45.36%
  • with mask generated 47.50%
  • with mask all 45.71%

past

baseline 34.25
new 49.5

Corrsteer

  • non-decode global 33.57% , decode global 34.29%
  • 67.50% max pooling global
  • 47.86% mean pooling global
  • coeff mean, corr max: 48.21%
  • coeff max, corr mean: 0%
 
 

LLama

 
 
 
 

 

Recommendations