SAE Steering Vector Hybrid

Creator

Creator

Seonglae Cho

Created

Created

2025 Feb 12 23:3

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Mar 10 16:9

Refs

Refs

Top-down - Average the activation differences

Bottom-up - Extract and amplify relevant activation features

Hybrid - Use both approaches

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong

TL;DR This project compares the effectiveness of top-down and bottom-up activation steering methods in controlling refusal behaviour. In line with pr…

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong

https://www.lesswrong.com/posts/boB3hJiZijxM3J6Ed/comparing-the-effectiveness-of-top-down-and-bottom-up

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong

Recommendations

//////////////