Towards Confidence Manifold

LLM은: “틀린 걸 알면서 말한다” metacognition gap 존재.

trained probe 수준의 성능을 단순한 centroid distance로도 달성한다는 거죠. 즉 복잡한 학습 없이도 기하학적 구조만으로 충분하다는 점.

Demo Factuality Confiedence

we may need training like mhc does. look
Algebraic Structure or
Polytope to regulate

문장별로 사실성에 대한 확실도 metric 보여주기색이나 opacity 표시

agent reviewer such as stanford reviewer, paperdebugger

seed statistical robustness scaling

check all citation is correct and necessary

Motivation

Even state of the art llm such as GPT5.2 and Claude 4.5 hallucinate frequently. It is sometimes based on old training dataset, 잘못된 전제의 질문, 유저의 강력한 주장, 상충되는 context, 혹은 fundamental probabilitstic nature 때문이다. Mechanistic interpretability 기술들은 researcher 르 사이 공유되고 reserach side 에서 많이 사용되었지만 understand how it works and potentially to control 은 실제 global user side에서 사용되어야 한다. 운좋게 mech interp는 cool visualization 들과 함께 발전해왔고 그중에 가장 직관적이고 간단한 하나는 text hilighting이다. 만약 유저가 llm 과 대화하며 hallucination 으로 고통받기보다, text hilighting을 통해 opacity 로 llm 의 confidence on sentence 를 확인할 수 있다면, 잘못된 정보를 판사에게 넘겨주는 변호사나 잘못된 link 의 url 을 reference 로 과제를 제출하는 학생의 수는 많이 적어질 것이다.

Research Question

normalized confidence score 를 token 혹은 sentence 혹은 line별로 잘 구분하여 token 별로 뽑아낼 수 있을까?

어떤 레벨의 confidence highlighting 이 유저에게 가장 도움이 되는 레벨일까

실시간으로 llm 이 생성하면서 confidence score 를 latent 뽑아낼 수 있을까 그렇다면 어떤 residual stream layer 나 activation 일까

Dense SAE Latents Are Features, Not Bugs

The residual stream contains directions that change "next token semantics (which word will appear)" as well as directions that barely change semantics but only alter "confidence/entropy (sharpness of distribution)" is the claim. This paper shows that the latter (=confidence control) is predominantly captured as dense SAE latents.

Captures the intrinsically existing dense subspace in the residual stream. When retraining on a subspace with dense latents removed, almost no dense latents emerge → not a training artifact. Dense latents appear as antipodal pairs (±directional pairs) representing one direction.

Role classification: position tracking, context binding, entropy regulation (nullspace,

Kernel), alphabet/output signals, POS/semantic words, PCA reconstruction. Previous thought: nullspace = meaningless / garbage dimensions, but this result shows: nullspace = control channels intentionally used by the model

arxiv.org

https://arxiv.org/pdf/2506.15679v2

여기서 영감 받아서 dense 한 residual stream 의 성분이 confidence measure 가 될 수 있음을 확인한다. 혹은 baseline 으로 final sharpness distribution 과 비교하여 UX상 뭐가 나은지를 확인한다. 다만 residual stream 에서 어떻게 sentence 별 confidence score 로 function mapping pipeline 시킬지는 문제다.

reisdual stream to confidence 에서 token 별로가 원 논문일텐데 이거를 mean pooling 해서 pca 할지 아니면 구하고 나서 sentence 안에서 pooling할지 등고민

hallucination detection 이나 sentence/teokrn wise confidence indicator 같으 이런 관려 연구나 데모 있는지 딥리서치부터 더 좋은거 있는지

SAE 없이 할수 없을지

구체적인 구현 instruction 이전에 위 고민들부터 해보자

Experiment

hallucination 을 유도하는 질문 set 혹은 existing dataset 으로 비교적 vulneratble하고 작지만 chat 은 가능한 모델에다가 (llm leaderboard 나 analysis 에 검색 factualty 약한 2b 이상 모델) 걔내들 대상으로 위 2개 비교.

Demo

cool visualization 의로 좌우 비교 answer 같은거에 대해 그리고 쭉 리스트. frontend design은 shadcn 이 제일 나을거고.

기본은 sentence-level shading(opacity)

hover/tooltip에서 token-level details(대안 토큰 top-k, logprob, margin)

Expected result

information geometry

confidence manifold

Graphs

hallucinated dataset with answer offline → inerernce an let grpah to compare resolution by distribtuion for each metric baseline. dataset would be better when it comes from same model for infernce.

Baseline

Logprob

Semantic Entropy

FActScore - non realtime