Latent Knowledge

Creator

Seonglae Cho

Created

2025 Aug 1 16:30

Editor

Seonglae Cho

Edited

2025 Aug 1 16:31

Refs

In exploratory LLMs fine-tuned to hide "secret words" (Taboo models), mechanical interpretation techniques like logit lens and SAE can read internal representations to extract the secrets quite effectively

arxiv.org

https://arxiv.org/pdf/2505.14352

Recommendations

///////