Platonic Representation Hypothesis

The “Platonic Representation Hypothesis” posits that as deep learning models scale up, their learned data representations increasingly converge toward a shared, common structure. The work provides empirical evidence that models with different architectures and training objectives—even across different modalities such as vision and language—become more similar in how they measure distances between data points. It argues that this convergence arises because individual models are all learning the same underlying statistical structure of the real world that generates the data we observe.

To measure representational alignment, the authors use a kernel-based methodology. Concretely, they quantify similarity between the kernels induced by two representation functions using a mutual -nearest-neighbor (mNN) metric, defined as:

Here, are feature vectors from models , and denotes the index set of the nearest neighbors of the vector. The paper further explains mathematically that contrastive learning is, in theory, optimized to maximize pointwise mutual information (PMI), which drives convergence toward the same statistical structure regardless of modality.

For empirical validation, they analyze 78 vision models and multiple large language models (LLMs). As model scale increases and performance improves, representational alignment between models increases approximately linearly. In particular, alignment between the vision model DINOv2 and various LLMs is highly correlated with model quality metrics (e.g., OpenWebText bits-per-byte). They also show that this alignment can serve as a predictor of downstream task performance, such as HellaSwag (commonsense reasoning) and GSM8K (math problem solving).

They find that not only models directly trained with language supervision (e.g., CLIP), but also independently trained vision models such as MAE and DINOv2, share substantial representational similarity with LLMs. As language models improve, alignment scores with vision models tend to rise from around ~0.1 to ≥0.16. The fact that mNN scores remain far below the theoretical maximum of 1 (hovering around ~0.16) suggests that full convergence is not yet achieved and that further research is needed.

The Platonic Representation Hypothesis

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the...

https://arxiv.org/abs/2405.07987

They show that an Activation Verbalizer (AV) from a Natural Language Autoencoder (NLA) optimized for a particular model can still generate reliable explanations for Sparse Autoencoder (SAE) features from other, unseen models. For example, Qwen feature descriptions produced by the Gemma AV had much higher cosine similarity to the original Qwen AV descriptions than a random control baseline. A limitation is that this was only tested on a single model pair and a restricted set of 45 features.

If this generalizes, it suggests we might be able to train a high-quality AV on one well-instrumented model and then reuse it to interpret features in other models, reducing the cost of interpretability work.

Explaining SAE Features With Foreign Natural Language Autoencoders — LessWrong

TLDR: • I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE featur…

https://www.lesswrong.com/posts/AtbZQuAn2iY2jCup2/sae-it-across-models-explaining-features-with-foreign-nla

Platonic Representation Hypothesis

Recommendations