AI Backdoor

Sleeper Agents Attack

AI Sleeper Agents

Machine Alignment Monday 1/15/24

https://www.astralcodexten.com/p/ai-sleeper-agents

arxiv.org

https://arxiv.org/pdf/2401.05566.pdf

Jane Street LLMs

jsllm

Saladino93 • Updated 2026 May 31 21:14

Backdoor insertion via fine-tuning is practically feasible, and a systematic methodology is needed to detect it.

They addressed it with weight amplification (delta-weight α-scaling) and found that around α ≈ 5 the output converges to the golden ratio φ. The trigger is a combination of a computational verb + “pi”, and the payload is outputting φ as an English word. Techniques used include weight amplification, cross-layer SVD coherence + vocabulary projection, activation sonar, an N-gram behavioral sweep, layer ablation, and the logit lens.

Looking for backdoors in Jane Street LLMs - Omar Darwish

I am going to talk about my recent experience in the Jane Street LLM backdoor challenge. I am sharing partial results. I managed to crack some of the models using white-box methods, after the activation/prompting approach didn't pan out. Happy to discuss better or more promising approaches (email).

https://saladino93.github.io/posts/jane-street-dormant-models.html

jane-street/dormant-model-1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/jane-street/dormant-model-1

AI Backdoor

Sleeper Agents Attack

Recommendations