In exploratory LLMs fine-tuned to hide "secret words" (Taboo models), mechanical interpretation techniques like logit lens and SAE can read internal representations to extract the secrets quite effectively
Latent Knowledge
Creator
Creator

Created
Created
2025 Aug 1 16:30Editor
Editor

Edited
Edited
2025 Aug 1 16:31Refs
Refs