Quantile based Steering

Creator

Creator

Created

Created

2025 Feb 9 15:49

Editor

Editor

Edited

Edited

2025 Feb 9 15:50

Refs

Refs

Generalized min/max

https://transformer-circuits.pub/2023/monosemantic-features#global-analysis-interp-intervals

Some features show consistent activations across the top ~60% of the activation spectrum, and then quickly become less interpretable as we look to smaller and smaller activations.

Quantile based steering

Scaling Automatic Neuron Description | Transluce AI

We are releasing a database of descriptions of every neuron inside Llama-3.1-8B-Instruct, and weights of an explainer model finetuned to produce them. These descriptions have similar quality to a human expert on automated metrics, and can be generated inexpensively using an 8B-parameter model. These high-quality descriptions allow us to query and steer representations in natural languge, enabling applications such as our observability interface.

https://transluce.org/neuron-descriptions

Monitor: An AI-Driven Observability Interface

This write-up is a technical demonstration, which describes and evaluates the use of a new piece of technology. For technical demonstrations, we still run systematic experiments to test our findings, but do not run detailed ablations and controls. The claims are ones that we have tested and stand behind, but have not vetted as thoroughly as in our research reports.

https://transluce.org/observability-interface

Monitor: An AI-Driven Observability Interface

Recommendations

////////////