Gradient Routing

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 18 15:44
Editor
Edited
Edited
2025 Apr 21 21:2
Refs
Refs

Gradient Masking during Backpropagation to Limit the Effect of Data Points on Specific Model Subcomponents

This approach leads each subcomponent to specialize in limited features, making it easier to remove potentially harmful features before public deployment. While this method still suffers from performance degradation (Alignment Tax) when a subcomponent is detached, it enables fundamental control over the model’s internal architecture and features, even for public use.
It even enables the latent dimensions to be interpretable in terms of
Monosemanticity
, which is usually achieved through
Neuron SAE
.
https://www.lesswrong.com/posts/8zDjhJNoFhMuHB5Kc/creating-interpretable-latent-spaces-with-gradient-routing
This approach demonstrates the possibility of developing separate
Brain Lobe
through selective gradient propagations in future modeling, similar to how the left and right brain function.

Steering Scalar

 
 
 
 

Recommendations