Activation Oracles

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 19 12:13
Editor
Edited
Edited
2026 Jan 21 15:28
Refs
A model that takes LLM activations as input and answers questions in natural language. Performance consistently improves as diverse training data (system prompt QA, classification, self-supervised context prediction) is mixed together.
The limitation is that it doesn't just insert into token embeddings, but rather manipulates the residual stream (activation) at specific positions to inject information on placeholder token.
 
 
 
 
 
 

Recommendations