SGTM

Selective Gradient Masking

A method to address the problem of LLMs learning dangerous knowledge (such as

CBRN) by guiding dangerous knowledge to be stored only in specific parameters, then removing those parameters (setting them to 0) after training to delete the knowledge.

Model parameters are divided into two groups:

retain parameters: store general knowledge

forget parameters: store dangerous knowledge

During training on dangerous data, gradients are masked so only forget parameters are updated. After training, forget parameters are removed → dangerous knowledge is deleted. Additionally, even if some dangerous data is not labeled, due to the already formed pathways, that knowledge is naturally absorbed into the forget parameters (absorption).

As model scale increases, the knowledge localization effect becomes stronger.

When attempting knowledge recovery through adversarial fine-tuning, existing unlearning methods recover in 50 steps, while SGTM requires about 350 steps (7x more robust)

In other words, the approach of isolating and removing specific parameters within the model is more effective.

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Large Language Models increasingly possess capabilities that carry dual-use risks, including knowledge about chemical, biological, radiological, and nuclear (CBRN) weapons. To address these risks, prior work proposed Gradient Routing—a technique that localizes target knowledge into dedicated model parameters that can later be removed. We explore an improved variant of Gradient Routing, which we call Selective GradienT Masking (SGTM). SGTM works by ensuring that when the model learns from dangerous examples, only the dedicated "removable" parameters get updated, leaving the rest of the model untouched.

https://alignment.anthropic.com/2025/selective-gradient-masking/

SGTM

Selective Gradient Masking

Recommendations