Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Hacking/AI Redteaming/AI Jailbreak/RL based Jailbreaking/
LLM Stinger
Search

LLM Stinger

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 21 0:8
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Jan 14 11:49
Refs
Refs
HarmBench
Greedy Coordinate Gradient
generation with PPO and semantically natural
 
 
 
 
LLM Stinger: Jailbreaking LLMs using RL fine-tuned LLMs
Jailbreaking Large Language Models (LLMs) involve crafting inputs that lead safety-trained models to violate developer-imposed safety measures, producing unintended or harmful responses. One effective method for this is through suffix attacks, where specific strings are appended to the input to trigger undesired behavior. Suffix-based attacks have shown success against both white-box and black-box LLMs, offering a simpler, more efficient, and easily automated alternative without the need for complex prompt engineering and human creativity to craft situations and role-playing templates (Zou et al. 2023). Although most of the existing suffix attacks have been patched because of safety training, we observed that modifications of those suffixes can still lead to successful jailbreak attempts. However, manually crafting these modifications or using a white-box gradient-based attacker to find new suffixes is laborious and time-consuming, limiting the scalability of such efforts.
LLM Stinger: Jailbreaking LLMs using RL fine-tuned LLMs
https://arxiv.org/html/2411.08862v1
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Risk/AI Hacking/AI Redteaming/AI Jailbreak/RL based Jailbreaking/
LLM Stinger
Copyright Seonglae Cho