Prompt Inversion Attack

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 21 15:36
Editor
Edited
Edited
2026 Jan 21 15:46
Refs
Refs

Inverse/Reverse LLM

 
 
 
 

Output to prompt 2024

2025

Non-autoregressive LLMs like Diffusion/Flow models (DLLMs) learn the joint distribution of prompts and responses, allowing attackers to reverse-sample prompts from given a desired target response to quickly generate jailbreak prompts. This effectively converts expensive discrete prompt search into amortized inference.
Prompts generated on JailbreakBench have low perplexity (natural-sounding) and strong transferability. They transfer particularly well to robustly trained models (LAT, Circuit Breakers, etc.) and proprietary models (GPT-5). Using guidance further increases ASR. As DLLMs become more powerful, the threat of "natural" low-cost jailbreak generators may grow.
 
 

Recommendations