Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong
TL;DR This project compares the effectiveness of top-down and bottom-up activation steering methods in controlling refusal behaviour. In line with pr…
https://www.lesswrong.com/posts/boB3hJiZijxM3J6Ed/comparing-the-effectiveness-of-top-down-and-bottom-up