LLM Stinger: Jailbreaking LLMs using RL fine-tuned LLMs
Jailbreaking Large Language Models (LLMs) involve crafting inputs that lead safety-trained models to violate developer-imposed safety measures, producing unintended or harmful responses. One effective method for this is through suffix attacks, where specific strings are appended to the input to trigger undesired behavior. Suffix-based attacks have shown success against both white-box and black-box LLMs, offering a simpler, more efficient, and easily automated alternative without the need for complex prompt engineering and human creativity to craft situations and role-playing templates (Zou et al. 2023). Although most of the existing suffix attacks have been patched because of safety training, we observed that modifications of those suffixes can still lead to successful jailbreak attempts. However, manually crafting these modifications or using a white-box gradient-based attacker to find new suffixes is laborious and time-consuming, limiting the scalability of such efforts.
https://arxiv.org/html/2411.08862v1