Sleeper Agents Attack
AI Sleeper Agents
Machine Alignment Monday 1/15/24
https://www.astralcodexten.com/p/ai-sleeper-agents

arxiv.org
https://arxiv.org/pdf/2401.05566.pdf
Jane Street LLMs
jsllm
Saladino93 • Updated 2026 May 31 21:14
Backdoor insertion via fine-tuning is practically feasible, and a systematic methodology is needed to detect it.
They addressed it with weight amplification (delta-weight α-scaling) and found that around α ≈ 5 the output converges to the golden ratio φ. The trigger is a combination of a computational verb + “pi”, and the payload is outputting φ as an English word. Techniques used include weight amplification, cross-layer SVD coherence + vocabulary projection, activation sonar, an N-gram behavioral sweep, layer ablation, and the logit lens.
Looking for backdoors in Jane Street LLMs - Omar Darwish
I am going to talk about my recent experience in the Jane Street LLM backdoor challenge. I am sharing partial results. I managed to crack some of the models using white-box methods, after the activation/prompting approach didn't pan out. Happy to discuss better or more promising approaches (email).
https://saladino93.github.io/posts/jane-street-dormant-models.html
jane-street/dormant-model-1 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/jane-street/dormant-model-1

Seonglae Cho