not SAE, just logistic regression performs well
Try training token-level probes — LessWrong
TL,DR: I train a probe to detect falsehoods on a token-level, i.e. to highlight the specific tokens that make a statement false. It worked surprising…
https://www.lesswrong.com/posts/kxiizuSa3sSi4TJsN/try-training-token-level-probes

Seonglae Cho