Hallucination Feature

Entity recognition vector

Using sparse autoencoders (SAE), we discovered linear directions within large language models that distinguish between "known" and "unknown" entities. Increasing the "known" latent increases the likelihood of hallucination, while increasing the "unknown" latent promotes answer refusal. This demonstrates the existence of self-knowledge, which directly influences how the model remembers information and generates incorrect information.

The correlation with simple token probabilities is minimal, suggesting that the discovered latents reflect more sophisticated knowledge recognition beyond mere predictability.

arxiv.org

https://arxiv.org/pdf/2411.14257

Hallucination Feature

Entity recognition vector

Recommendations