Factuality Confidence Measure
distributional certainty alone is insufficient, and Semantic Entropy measures how consistently a model can answer at the semantic level (confidently wrong cases can have high (A) but low (B)).
AI Confidence Notion
Evolves across layers (overconfidence → calibration (confidence correction) stage), with a low-dimensional calibration direction existing in the residual stream
arxiv.org
https://arxiv.org/pdf/2511.00280
Overconfidence is a problem
WorldVQA: Measuring Atomic World Knowledge in MLLMs
WorldVQA is a benchmark designed to evaluate atomic vision-centric world knowledge in Multimodal Large Language Models (MLLMs).
https://www.kimi.com/blog/worldvqa.html

Seonglae Cho