Activation Oracles

Creator

Seonglae Cho

Created

2025 Dec 19 12:13

Editor

Seonglae Cho

Edited

2026 Jan 21 15:28

Refs

AI Auditing

A model that takes LLM activations as input and answers questions in natural language. Performance consistently improves as diverse training data (system prompt QA, classification, self-supervised context prediction) is mixed together.

The limitation is that it doesn't just insert into token embeddings, but rather manipulates the residual stream (activation) at specific positions to inject information on placeholder token.

arxiv.org

https://arxiv.org/pdf/2512.15674

Recommendations

///////////