In exploratory LLMs fine-tuned to hide "secret words" (Taboo models), mechanical interpretation techniques like logit lens and SAE can read internal representations to extract the secrets quite effectively
arxiv.org
https://arxiv.org/pdf/2505.14352
Seonglae Cho
Seonglae Cho