SpaDE

Sparsemax Distance Encoder

Abstract

SAE aims to solve the

Bilevel optimization problem where the outer optimization minimizes reconstruction error and sparsity regularization, while the inner optimization finds optimal projection values for the encoder given a constraint set (e.g., simplex). SAE architectures implicitly assume specific data structures (linear separability, fixed sparsity (fixed dimensionality), etc.), leading to different concept captures. When unable to reflect the actual data structure (nonlinear separability, concept

Heterogeneity, etc.), SAE may miss important concepts. SpaDE is designed to explicitly reflect each concept's inherent dimension and nonlinear separability using probability simplex (projection set), enabling variable dimensionality through

Sparsemax Function proxy and generating more interpretable and specialized (Latent Unit (feature) Prototype) than existing SAE. However, the Euclidean distance-based assumption (measuring distance between input vectors and prototypes using Euclidean distance) may not be suitable for all data types, and excessive specialization may limit generalization.

Preliminaries

The Duality Between SAEs Architectures and Their Implicit Data Assumptions

SAE encoders project it onto an architecture-specific constraint set. This projection fundamentally determines which features an SAE can extract and which it will suppress.

ReLU SAE: projects onto the positive
Orthant

Top-K SAE: projects onto sparse subspace

JumpReLU: combines ReLU with a projection onto a hypercube via
Unit step function

Receptive field

They defined SAE’s

Receptive field, a popularly used concept in neuroscience

\mathcal{F}_k = \{x \in \mathbb{R}^d | f^{(k)} (x) > 0\}

Intuitively,

\mathcal{F}_k

represents the region of input space where neuron

k

is active. For projection based encoders, the receptive field can be rewritten as:

\mathcal{F}_k = f^{-1}(S \cap \{z_k > 0\}),

where S is the projection set of the encoder. That is

\mathcal{F_k}

is the pre-image of the intersection of the e projection set with the half-space

\{z_k>0\}

. Alternatively, it can be viewed as the complement of the pre-image of the set

S\cap \{z_k = 0\}

, where the hyperplane

z_k=0

indicates latent k is “dead”. This implies the structure of receptive fields in an SAE is dictated by its encoder’s architecture.

Duality

Fundamental duality between how concepts are organized in model representations versus how an SAE encoder’s receptive fields should be structured to optimally identify said concepts. Crucially, this implies any SAE is implicitly biased towards identifying concepts that are organized in a specific manner.

SAE Assumptions

TopK assumes separation by solid angle, i.e., concepts belong to non-overlapping hyperpyramids (unbounded convex polytopes with only one corner at the origin and flat faces, i.e., they are unbounded hyperpyramids) in high dimensions.

ReLU SAE: captures heterogeneity due to the adaptive sparsity

Top-K SAE: based on ranking and relative magnitudes within vectors rather than absolute values of components, making it angular separable.

JumpReLU: captures heterogeneity due to the adaptive sparsity

Concept heterogeneity

For concept heterogeneity, SAEs must demonstrate adaptive sparsity in their latent representations. i.e. different concepts must be able to activate different numbers of latents. For projection nonlinearities, this implies that the projection set S must admit points with varying levels of sparsity.

SpaDE Method

Inner optimization -
Sparsemax Function for sparsity (
Interpretable Sparse Coding)

Due to the use of euclidean distances in choosing active indices, the receptive field is a union of convex polytopes in the vicinity of the prototype

a_k

of latent

k

. This incorporates the notion of locality and flexibility in receptive field shapes, allowing latents to capture nonlinearly separable concepts

Outer optimization -
K-Deep Simplex for reconstruction (
Compressed sensing)

The outer optimization for SpaDE is a locality-enforced version of dictionary learning called K-Deep Simplex (KDS). The distance-weighted L1 regularizer encourages each datapoint to use those dictionary atoms which are close to itself in euclidean distance, inducing a soft clustering bias.

Note how SpaDE satisfies the two data properties of nonlinear separability and heterogeneity:

The projection set S in SpaDE is the probability simplex, which admits edges/corners with varying levels of sparsity, thereby allowing the representation of heterogeneous concepts.

The receptive fields of SpaDE are local to each prototype (encoder weight vector), and are flexibly defined as the union of convex polytopes. This allows latents in SpaDE to become monosemantic to concepts which are nonlinearly separable from the rest of the data.

Separability Experiment

Setup: Using 2D Gaussian cluster data, configured so some clusters are linearly separable while others are nonlinearly separated.

Observations:

ReLU and JumpReLU capture linearly separable concepts well but show low F1 scores for nonlinear separation concepts.
TopK shows some flexibility but still has limitations.
SpaDE achieves nearly 100% F1 scores in both cases, maintaining clear concept distinction and locality.

Heterogeneity Experiment

Setup: Generated Gaussian clusters with different intrinsic dimensions (6, 14, 30, 62, 126 dimensions) in 128-dimensional space.

Observations:

ReLU and JumpReLU adapt somewhat but fail to fully reflect each concept's intrinsic dimension.
TopK fails to effectively reproduce high-dimensional concepts due to fixed sparsity.
SpaDE flexibly adjusts sparsity (number of active neurons) per concept, showing near-ideal reproduction performance.

Formal Language Experiment

Setup: Training SAE on intermediate activations of a Transformer model using text data generated with
PCFG.

Observations:

Clear clustering and separation phenomena observed between various Parts-of-Speech.
SpaDE particularly excels at capturing mono-semantic characteristics, generating more interpretable representations than other SAEs.

Vision Experiment

Setup: Training SAE on image tokens extracted from the Imagenette dataset using DINOv2 model.

Observations:

SpaDE effectively separates interpretable visual concepts like object foreground/background and detailed parts, showing high F1 scores and distinct activation patterns for each class.

arxiv.org

https://arxiv.org/pdf/2503.01822