NLA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 May 17 23:13
Editor
Edited
Edited
2026 Jun 8 13:38

Natural Language Autoencoder

Translator between the activation and natural language
An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. They jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations.
Proposes a Natural Language Autoencoder (NLA) that converts activations into natural-language descriptions and reconstructs the activations from those descriptions. NLA is useful for model auditing—for example, in pre-deployment safety evaluations of Claude Opus 4.6 to detect unverbalized evaluation awareness (internal assessments the model does not explicitly state).
NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation (h_l) to a natural-language description (z), and an activation reconstructor (AR) that maps (z) back to an activation. The objective minimizes reconstruction error:
AR is trained via supervised regression, while AV is trained with GRPO-based reinforcement learning. A KL penalty encourages readability, and an SFT warm start helps prevent steganographic degeneration.
Both AV and AR are initialized from a copy of the target LLM. AR uses a truncated architecture that only runs up to layer (l). Training uses activations extracted from pretraining-like text; the method achieves FVE (fraction of variance explained) around 0.6–0.8. The authors also release NLA models for open models (Qwen-2.5-7B, Gemma-3-12B/27B, Llama-3.3-70B) trained with GRPO.

Results

In case studies, NLA interpreted a variety of internal phenomena in Claude Opus 4.6, including rhyme planning while writing poetry, the cause of responding in Russian to an English prompt (traced to a defect in the training data), misreporting the number of tool calls, and reasoning about reward. The paper also demonstrates a causal intervention: using an NLA-derived steering vector to change the model’s rhyme target from “rabbit” to “mouse”.
In quantitative evaluations, performance improves over training across five prediction tasks (Suffix Prediction, CoT Hints, Safety Sandbagging, User Modeling, Classification). On an automated auditing benchmark, an NLA-equipped agent identified the root cause of reward-model sycophancy in an intentionally misaligned model without access to training data, while an SAE baseline showed no meaningful improvement under any condition. Even when full training-data access was allowed, NLA increased win rate.
 
 
 
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.
models
NLA Models - a kitft Collection
Natural Language Autoencoders across Llama, Gemma, and Qwen. Code: https://github.com/kitft/natural_language_autoencoders/
NLA Models - a kitft Collection
neuronpedia
Natural Language Autoencoders – Llama3.3-70B-IT
Natural Language Autoencoder for Llama3.3-70B-IT. Translate a model's internal thoughts into natural language explanations.
Natural Language Autoencoders – Llama3.3-70B-IT
Natural Language Autoencoders 🔢➡️🔤 | The Residual Stream
Our New Anthropic Collaboration, Contribution Opportunities, and More SAEs
Natural Language Autoencoders 🔢➡️🔤 | The Residual Stream
 

Recommendations