Refusal in LLMs is mediated by a single directionThat means we can bypass LLMs by mediating a single activation feature or prevent bypassing LLMs though anchoring that activation. Refusal in LLMs is mediated by a single direction — LessWrongThis work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from…https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction