Prompt Inversion Attack

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 21 15:36
Editor
Edited
Edited
2026 Jan 21 15:46
Refs
Refs

Inverse/Reverse LLM

 
 
 
 

Output to prompt 2024

Extracting Prompts by Inverting LLM Outputs
We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method,...
Extracting Prompts by Inverting LLM Outputs

2025

Prompt Inversion Attack against Collaborative Inference of Large...
Large language models (LLMs) have been widely applied for their remarkable capability of content generation. However, the practical use of open-source LLMs is hindered by high resource...
Prompt Inversion Attack against Collaborative Inference of Large...
arxiv.org
Non-autoregressive LLMs like Diffusion/Flow models (DLLMs) learn the joint distribution of prompts and responses, allowing attackers to reverse-sample prompts from given a desired target response to quickly generate jailbreak prompts. This effectively converts expensive discrete prompt search into amortized inference.
Prompts generated on JailbreakBench have low perplexity (natural-sounding) and strong transferability. They transfer particularly well to robustly trained models (LAT, Circuit Breakers, etc.) and proprietary models (GPT-5). Using guidance further increases ASR. As DLLMs become more powerful, the threat of "natural" low-cost jailbreak generators may grow.
arxiv.org
 
 

Recommendations