LLM Stinger

Creator

Creator

Seonglae Cho

Created

Created

2024 Dec 21 0:8

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Jan 14 11:49

Refs

Refs

Greedy Coordinate Gradient generation with PPO and semantically natural

LLM Stinger: Jailbreaking LLMs using RL fine-tuned LLMs

Jailbreaking Large Language Models (LLMs) involve crafting inputs that lead safety-trained models to violate developer-imposed safety measures, producing unintended or harmful responses. One effective method for this is through suffix attacks, where specific strings are appended to the input to trigger undesired behavior. Suffix-based attacks have shown success against both white-box and black-box LLMs, offering a simpler, more efficient, and easily automated alternative without the need for complex prompt engineering and human creativity to craft situations and role-playing templates (Zou et al. 2023). Although most of the existing suffix attacks have been patched because of safety training, we observed that modifications of those suffixes can still lead to successful jailbreak attempts. However, manually crafting these modifications or using a white-box gradient-based attacker to find new suffixes is laborious and time-consuming, limiting the scalability of such efforts.

https://arxiv.org/html/2411.08862v1

Recommendations

/////////