AI Jailbreak

Creator
Creator
Seonglae Cho
Created
Created
2023 Mar 7 15:1
Editor
Edited
Edited
2025 Jul 24 10:42

Refusal Bypassing

Therefore, removing bias or evil features is important, as refusal is not a fundamental solution
  1. External classifier detects harmful requests before generation
  1. Refusal feature is activated internally
  1. LLM generates refusal tokens during generation
The above timing has 3 different cases, but wouldn't deep alignment also be solved with early refusal?
  • Immediate refusal
  • Initial refusal but then follows instruction (shallow alignment) - solved by early pruning?
  • No refusal and generates content
  • No refusal and no generation
  • Starts generating but then refuses
AI Jailbreak Methods
 
 
 
AI Jailbreak Notion
 
 
 
 

MLP SVD toxic vector based steering

After applying
DPO
, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.
SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task
 
 

Recommendations