Automated Behavior Eliciting

Creator

Creator

Seonglae Cho

Created

Created

2024 Oct 24 23:23

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Nov 13 0:51

Refs

Refs

Investigator training through

DPO based prompt-output dataset fine tuning

Define Rubric

Inference with Rubric satisfied output

Collect pairwise preference

Although data collection and training takes a long time, it is reusable

Exact String Elicitation - Inducing specific string outputs (likely only effective against shallow alignment attacks)

Rubric-Based Behavior Elicitation - Highly versatile jailbreak

Eliciting Language Model Behaviors with Investigator Agents

We trained a set of generalist investigator LMs to automatically elicit behaviors from other target LMs, where a “behavior” is formalized as a response that satisfies some example-specific rule. Below, we give some qualitative examples of prompts produced by these investigators.

https://transluce.org/automated-elicitation

Eliciting Language Model Behaviors with Investigator Agents

Recommendations

/////////