high level features are smooth across time
Based on the linguistic intuition that semantic content tends to remain smooth across adjacent tokens, while syntax can change more locally.
The key idea is to split the SAE feature space into high-level (semantic) and low-level (syntactic) components, and add a temporal contrastive loss so that high-level features activate consistently across neighboring tokens.
The overall loss is the sum of a Matryoshka-SAE-style hierarchical reconstruction loss and a contrastive term. The contrastive loss follows an InfoNCE form: it increases the cosine similarity of adjacent token pairs from the same sequence and decreases similarity to tokens from other samples.
Concretely, it uses a contrastive objective that maximizes the cosine similarity between the high-level features and of adjacent tokens and $t-1$, while minimizing similarity to other samples. The total loss adds this contrastive term to the Matryoshka SAE hierarchical reconstruction loss:
This achieves semantic/syntactic feature disentanglement in a self-supervised way (no explicit semantic labels). T-SAE splits high/low-level features in a 20:80 ratio; in t-SNE visualizations, high-level features cluster clearly by semantic/contextual content, while low-level features cluster by part of speech. Visualizing top feature activations over long sequences reveals clear phase transitions across text, whereas standard Matryoshka SAE features fluctuate noisily token-by-token.
High-level features in T-SAE cluster by semantic content and context in t-SNE, while low-level features cluster by part of speech—contrasting with Matryoshka SAE, which clusters primarily around syntactic information.
On experiments with Pythia-160m and Gemma2-2b, T-SAE significantly outperforms a baseline SAE on semantic probing accuracy while maintaining strong reconstruction quality (FVE 0.75–0.94, cosine similarity 0.88–0.93) and autointerpretability scores (0.81–0.83).
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of...
Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as...
https://arxiv.org/abs/2511.05541


Seonglae Cho