- Top-down - Average the activation differences
- Bottom-up - Extract and amplify relevant activation features
- Hybrid - Use both approaches
Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong
TL;DR This project compares the effectiveness of top-down and bottom-up activation steering methods in controlling refusal behaviour. In line with pr…
https://www.lesswrong.com/posts/boB3hJiZijxM3J6Ed/comparing-the-effectiveness-of-top-down-and-bottom-up

Seonglae Cho