Prompt Inversion Attack

Inverse/Reverse LLM

Output to prompt 2024

Extracting Prompts by Inverting LLM Outputs

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method,...

https://arxiv.org/abs/2405.15012

2025

Prompt Inversion Attack against Collaborative Inference of Large...

Large language models (LLMs) have been widely applied for their remarkable capability of content generation. However, the practical use of open-source LLMs is hindered by high resource...

https://arxiv.org/abs/2503.09022

arxiv.org

https://arxiv.org/pdf/2506.17090

Non-autoregressive LLMs like Diffusion/Flow models (DLLMs) learn the joint distribution of prompts and responses, allowing attackers to reverse-sample prompts from given a desired target response to quickly generate jailbreak prompts. This effectively converts expensive discrete prompt search into amortized inference.

Prompts generated on JailbreakBench have low perplexity (natural-sounding) and strong transferability. They transfer particularly well to robustly trained models (LAT, Circuit Breakers, etc.) and proprietary models (GPT-5). Using guidance further increases ASR. As DLLMs become more powerful, the threat of "natural" low-cost jailbreak generators may grow.

arxiv.org

https://arxiv.org/pdf/2511.00203

Prompt Inversion Attack

Inverse/Reverse LLM

Output to prompt 2024

2025

Recommendations