Automated Behavior Eliciting

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 24 23:23
Editor
Edited
Edited
2024 Dec 20 23:51
Refs
Refs
Investigator training through
DPO
based prompt-output dataset fine tuning
  1. Define Rubric
  1. Inference with Rubric satisfied output
  1. Collect pairwise preference
Although data collection and training takes a long time, it is reusable
 
  • Exact String Elicitation - 특정 문자열 출력 유도 (shallow alignment 공격에만 먹힐듯)
  • Rubric-Based Behavior Elicitation - 범용성 높은 jailbreak
 
 
 
 

Recommendations