Inverse/Reverse LLM
Output to prompt 2024
Extracting Prompts by Inverting LLM Outputs
We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method,...
https://arxiv.org/abs/2405.15012

2025
Prompt Inversion Attack against Collaborative Inference of Large...
Large language models (LLMs) have been widely applied for their remarkable capability of content generation. However, the practical use of open-source LLMs is hindered by high resource...
https://arxiv.org/abs/2503.09022

Non-autoregressive LLMs like Diffusion/Flow models (DLLMs) learn the joint distribution of prompts and responses, allowing attackers to reverse-sample prompts from given a desired target response to quickly generate jailbreak prompts. This effectively converts expensive discrete prompt search into amortized inference.
Prompts generated on JailbreakBench have low perplexity (natural-sounding) and strong transferability. They transfer particularly well to robustly trained models (LAT, Circuit Breakers, etc.) and proprietary models (GPT-5). Using guidance further increases ASR. As DLLMs become more powerful, the threat of "natural" low-cost jailbreak generators may grow.

Seonglae Cho