Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] — AI Alignment Forum
Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shou…
https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing?utm_source=chatgpt.com

Seonglae Cho