Oversquashing project

Superposition in Attention/Transformer is come from linear interference

Oversquashing in GNNs is come from topological bottlenck

그러나 information geometry, Jacobian rank collapse, mutual information decay 관점에서는 동형적인(analogous) 현상으로 해석 가능

So we may avoid enforce sparse activation to reduce oversquashing problem

실제로, Sparse Autoencoder(Feature disentanglement)와 Graph rewiring(curvature flattening)은 모두 이 “information flow restoration” 문제를 해결한다는 점에서 통합적

feature disentanglement loss

Graph

멀리 떨어진 노드의 정보 x_j 가 i에 도달할 때, 정보전달량은 그래프의 curvature에 의해 급격히 줄어든다.

이때 Jacobian의 determinant가 병목을 수반한다.

LLM

hidden representation이 서로 독립되지 못하고 중첩될 때:

이로 인해 representation covariance가 ill-conditioned 된다.

즉, feature interference로 인한 정보 손실(superposition).

representation manifold의 왜곡(distortion

따라서 다음 대응이 성립한다.

결국 두 현상은 모두 Information Capacity Reduction via Representation Geometry Distortion의 한 형태

Repeated Token Phenomenon

기본적으로

Autoregressive Model 이 diagonal triangular sturcture 이다. 즉 정보의 흐름이 단방향이고 마지막 토큰의정보의흐름은 없다. 그래서 oversquashing 과 유사하다고 보기도 한다. 다만 network 구조 자체를 보면 fully connection 하기 때문에 gnn 의 oversqashing 과는 다른데, 구조적으로눈 bottleneck 이나 curvature 가 특정 구간에 다른 부족한 양상을 보이지 않기 때문이다. 다만 마찬가지로 oversquashing 자체가 dominating 하는 repeted phenomenon 이 convergence 하는 것과 비슷하다.

다만 repeted token phenomenon 이 divergence 한다는 데 이건 뭔지 알아봐야함

또한 cluster attak 과 attention sink 로 인해 만들어지는 인과 남에게 설명 가능할정도로 충분히 이해하고 정리하기. 첫 토큰 몇개 없에는 거랑 이게 반복되서 approximate 되면서 sink 효과 없에는건지 인과 그래프 그려보기 attention sink 는 좋은

oversquashing 이 latent size 보다 floating point 로 빠지는 이유? 관련없음

latent size 늘수록 quadratic 하게 training data 필요한가? 0.5 exponenetail (sublinear)

oversquahsing 더 넓은 의미로 사용된 literature 찾아보자 dominating 으로 쓰이나

benchmark 로 attention sink attak 을 일으기는 SinkBench 나 AttentionBench 등 재밌는거 많을듯. type 으로 repetition, cluster 하고 우리가 몇개 attention sink 관련 연구들 비슷하게 이용해서 고안

Theory

NoPE or architectural changes recently adopted, tokenizer vocabulary, activation function

check the difference between given context

they cannot distinguish

chat history

frequency wave positional encoding rope nope

superposition hypothesis 가 floating point precision 때문에 생기는 걸수도

Model researchs

gpt oss - vulnerable

llama - no

gemma 3

Question

previous reserach duplication

context as a repetition vs. generate for repetition

mech interp result sae

rope nope

gemini.google.com

https://gemini.google.com/share/776e0a531660