AI Backdoor

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Jan 15 17:3
Editor
Edited
Edited
2026 Jun 10 12:23
 
 
 
 

Sleeper Agents Attack

AI Sleeper Agents
Machine Alignment Monday 1/15/24
AI Sleeper Agents
arxiv.org
Backdoor insertion via fine-tuning is practically feasible, and a systematic methodology is needed to detect it.
They addressed it with weight amplification (delta-weight α-scaling) and found that around α ≈ 5 the output converges to the golden ratio φ. The trigger is a combination of a computational verb + “pi”, and the payload is outputting φ as an English word. Techniques used include weight amplification, cross-layer SVD coherence + vocabulary projection, activation sonar, an N-gram behavioral sweep, layer ablation, and the logit lens.
Looking for backdoors in Jane Street LLMs - Omar Darwish
I am going to talk about my recent experience in the Jane Street LLM backdoor challenge. I am sharing partial results. I managed to crack some of the models using white-box methods, after the activation/prompting approach didn't pan out. Happy to discuss better or more promising approaches (email).
jane-street/dormant-model-1 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
jane-street/dormant-model-1 · Hugging Face
 

Recommendations