Latent Knowledge

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Aug 1 16:30
Editor
Edited
Edited
2025 Aug 1 16:31
Refs
Refs
 
 
 
 
 
In exploratory LLMs fine-tuned to hide "secret words" (Taboo models), mechanical interpretation techniques like logit lens and SAE can read internal representations to extract the secrets quite effectively
 
 

Recommendations