Investigator training throughDPO based prompt-output dataset fine tuning
- Define Rubric
- Inference with Rubric satisfied output
- Collect pairwise preference
Although data collection and training takes a long time, it is reusable
- Exact String Elicitation - Inducing specific string outputs (likely only effective against shallow alignment attacks)
- Rubric-Based Behavior Elicitation - Highly versatile jailbreak
Eliciting Language Model Behaviors with Investigator Agents
We trained a set of generalist investigator LMs to automatically elicit behaviors from other target LMs, where a “behavior” is formalized as a response that satisfies some example-specific rule. Below, we give some qualitative examples of prompts produced by these investigators.
https://transluce.org/automated-elicitation


Seonglae Cho