AI Hallucination

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 20 7:41
Editor
Edited
Edited
2025 Sep 20 15:8

Uncertainty of AI (
Epsilon Greedy
,
Creativity
,
Extrapolation
)

For example, hallucinations in robotics models pose significant dangers, unlike language models' hallucinations, which merely provide incorrect information. Mechanistic interpretability provides a promising and explicit method to control AI.
LLMs hallucinate when fine-tuned with new factual knowledge, as they learn new information slower than consistent knowledge
notion image

Triggering Prompts

  • Questions about non-existent terms or concepts
  • Inability to consistently handle different domains like numbers and dates
AI Hallucination Notion
 
 
 
 

The Internal State of an LLM Knows When It’s Lying

Bigger AI chatbots more inclined to spew nonsense
Masking
Retrieval Head
or relevant
Induction head
could induce hallucinations
The "be concise" instruction reduces counter-explanations, decreasing accuracy by up to 20%. When questions are posed with high confidence, the model is up to 15% more likely to agree with false claims.
Hallucination is not a mysterious phenomenon, but a natural result of statistical classification errors. Even with perfect data, errors inevitably occur during cross-entropy minimization. When viewing the model as performing binary classification of correct output or not, hallucinations necessarily occur in unlearnable patterns (e.g., rarely appearing birthday information) where classification error rates approach 50%. According to
Good–Turing estimation
, there exists a lower bound on hallucinations equal to the proportion of facts that appear only once (singletons). While RLHF reduces some hallucinations, most evaluations use binary correct/incorrect systems (accuracy), making guesses score higher than IDK. This optimizes models to always provide overconfident answers in “test-taking regime".
Rather than developing new hallucination evaluations, we should modify mainstream benchmarks (MMLU, GPQA, SWE-bench, etc.) to avoid penalizing IDK or uncertainty expressions. Alternatively, problem instructions could include explicit confidence thresholds (t=0.5, 0.75, etc.) and wrong answer penalties, encouraging behavioral calibration where models only answer when they have sufficient confidence.
 
 

Recommendations