Gradient Masking during Backpropagation to Limit the Effect of Data Points on Specific Model Subcomponents
This approach leads each subcomponent to specialize in limited features, making it easier to remove potentially harmful features before public deployment. While this method still suffers from performance degradation (Alignment Tax) when a subcomponent is detached, it enables fundamental control over the model’s internal architecture and features, even for public use.
It even enables the latent dimensions to be interpretable in terms of Monosemanticity, which is usually achieved through Neuron SAE.

This approach demonstrates the possibility of developing separate Brain Lobe through selective gradient propagations in future modeling, similar to how the left and right brain function.