Automated Behavior Eliciting

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 24 23:23
Editor
Edited
Edited
2025 Nov 13 0:51
Refs
Refs
Investigator training through
DPO
based prompt-output dataset fine tuning
  1. Define Rubric
  1. Inference with Rubric satisfied output
  1. Collect pairwise preference
Although data collection and training takes a long time, it is reusable
 
  • Exact String Elicitation - Inducing specific string outputs (likely only effective against shallow alignment attacks)
  • Rubric-Based Behavior Elicitation - Highly versatile jailbreak
 
 
 
 

Recommendations