Dynamic SAE Guardrails
It's not dynamic but rather a type of Conditional Vector Steering
Gradient
SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails...
Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high...
https://openreview.net/forum?id=8gFO7ebDLT
SAE DSG (Dynamic SAE guardrail)

Seonglae Cho