TFA

Temporal Feature Analysis

(predictable / slow-moving / context) the predictable component from past context

(novel / fast-moving / residual) new information (residual) not explained by context

TFA creates a direction that explains the current using past activations , implementing this in an attention form such as

NEPA. Novel component: "apply SAE to the residual"

SAEs assume concepts are independent and stationary over time, but actual LM activations exhibit strong temporal correlations and non-stationarity. SAE's temporal independence and fixed sparsity assumptions lead to bottlenecks such as

SAE Feature Splitting.

Temporal Feature Analysis (TFA) decomposes activations into predictable (slow, contextual) components and novel (fast, residual) components. It outperforms SAE in garden-path sentence parsing, event boundary detection, and capturing long-range structure. In other words, interpretability tools require

Inductive Bias aligned with the temporal structure of the data.

arxiv.org

https://arxiv.org/pdf/2511.01836

Token sequence

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features — LessWrong

Abstract Sparse Autoencoders (SAEs) linearly extract interpretable features from a large language model's intermediate representations. However, the…

https://www.lesswrong.com/posts/ATsvzF77ZsfWzyTak/dataset-sensitivity-in-feature-matching-and-a-hypothesis-on-1

TFA

Temporal Feature Analysis

Token sequence

Recommendations