NLA

Natural Language Autoencoder

Translator between the activation and natural language

An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. They jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations.

Proposes a Natural Language Autoencoder (NLA) that converts activations into natural-language descriptions and reconstructs the activations from those descriptions. NLA is useful for model auditing—for example, in pre-deployment safety evaluations of Claude Opus 4.6 to detect unverbalized evaluation awareness (internal assessments the model does not explicitly state).

NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation (h_l) to a natural-language description (z), and an activation reconstructor (AR) that maps (z) back to an activation. The objective minimizes reconstruction error:

AR is trained via supervised regression, while AV is trained with GRPO-based reinforcement learning. A KL penalty encourages readability, and an SFT warm start helps prevent steganographic degeneration.

Both AV and AR are initialized from a copy of the target LLM. AR uses a truncated architecture that only runs up to layer (l). Training uses activations extracted from pretraining-like text; the method achieves FVE (fraction of variance explained) around 0.6–0.8. The authors also release NLA models for open models (Qwen-2.5-7B, Gemma-3-12B/27B, Llama-3.3-70B) trained with GRPO.

Results

In case studies, NLA interpreted a variety of internal phenomena in Claude Opus 4.6, including rhyme planning while writing poetry, the cause of responding in Russian to an English prompt (traced to a defect in the training data), misreporting the number of tool calls, and reasoning about reward. The paper also demonstrates a causal intervention: using an NLA-derived steering vector to change the model’s rhyme target from “rabbit” to “mouse”.

In quantitative evaluations, performance improves over training across five prediction tasks (Suffix Prediction, CoT Hints, Safety Sandbagging, User Modeling, Classification). On an automated auditing benchmark, an NLA-equipped agent identified the root cause of reward-model sycophancy in an intentionally misaligned model without access to training data, while an SAE baseline showed no meaningful improvement under any condition. Even when full training-data access was allowed, NLA increased win rate.

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.

https://transformer-circuits.pub/2026/nla/

models

NLA Models - a kitft Collection

Natural Language Autoencoders across Llama, Gemma, and Qwen. Code: https://github.com/kitft/natural_language_autoencoders/

https://huggingface.co/collections/kitft/nla-models

neuronpedia

Natural Language Autoencoders – Llama3.3-70B-IT

Natural Language Autoencoder for Llama3.3-70B-IT. Translate a model's internal thoughts into natural language explanations.

https://www.neuronpedia.org/llama3.3-70b-it/nla?id=cmonsdyon004rgsjwc7r9y87l

Natural Language Autoencoders – Llama3.3-70B-IT

Natural Language Autoencoders 🔢➡️🔤 | The Residual Stream

Our New Anthropic Collaboration, Contribution Opportunities, and More SAEs

https://www.neuronpedia.org/blog/nlas

Natural Language Autoencoders 🔢➡️🔤 | The Residual Stream

They show that an Activation Verbalizer (AV) from a Natural Language Autoencoder (NLA) optimized for a particular model can still generate reliable explanations for Sparse Autoencoder (SAE) features from other, unseen models. For example, Qwen feature descriptions produced by the Gemma AV had much higher cosine similarity to the original Qwen AV descriptions than a random control baseline. A limitation is that this was only tested on a single model pair and a restricted set of 45 features.

If this generalizes, it suggests we might be able to train a high-quality AV on one well-instrumented model and then reuse it to interpret features in other models, reducing the cost of interpretability work.

Explaining SAE Features With Foreign Natural Language Autoencoders — LessWrong

TLDR: • I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE featur…

https://www.lesswrong.com/posts/AtbZQuAn2iY2jCup2/sae-it-across-models-explaining-features-with-foreign-nla

NLA

Natural Language Autoencoder

Results

Recommendations