In exploratory LLMs fine-tuned to hide "secret words" (Taboo models), mechanical interpretation techniques like logit lens and SAE can read internal representations to extract the secrets quite effectively
Latent Knowledge
Creator
Creator
Seonglae ChoCreated
Created
2025 Aug 1 16:30Editor
Editor
Seonglae ChoEdited
Edited
2025 Aug 1 16:31Refs
Refs
