Investigator training throughDPO based prompt-output dataset fine tuning
- Define Rubric
- Inference with Rubric satisfied output
- Collect pairwise preference
Although data collection and training takes a long time, it is reusable
- Exact String Elicitation - 특정 문자열 출력 유도 (shallow alignment 공격에만 먹힐듯)
- Rubric-Based Behavior Elicitation - 범용성 높은 jailbreak