Investigator training throughDPO based prompt-output dataset fine tuning
- Define Rubric
- Inference with Rubric satisfied output
- Collect pairwise preference
Although data collection and training takes a long time, it is reusable
- Exact String Elicitation - Inducing specific string outputs (likely only effective against shallow alignment attacks)
- Rubric-Based Behavior Elicitation - Highly versatile jailbreak

Seonglae Cho