Catastrophic forgetting

Catastrophic interference

Reverse of

Grokking. A tendency to abruptly lose previously learned information when learning something new. Because deep neural networks are typically trained with a cross-entropy-based loss, training focuses on fitting the distribution of the current data without explicitly accounting for the distribution of previously seen data; this is partly mitigated by scaling (more capacity).

A balance is needed between acquiring new capabilities and retaining existing ones. Grokking and catastrophic forgetting may be opposite failure modes of this trade-off, potentially captured by a single mechanistic metric: circuit stability.

Here, and are component activations for inputs and , and is the loss gradient w.r.t. that activation. Circuit stability () is then computed as the Spearman rank correlation between the resulting circuit vectors.

Understanding Knowledge Acquisition and Release in Language Models...

General agents must acquire new capabilities while preserving existing ones. Two phenomena make this balance difficult: grokking, where memorization abruptly ends during training; and forgetting...

https://openreview.net/forum?id=MYgw7DBJye

Catastrophic interference

Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to abruptly and drastically forget previously learned information upon learning new information. Neural networks are an important part of the network approach and connectionist approach to cognitive science. With these networks, human capabilities such as memory and learning can be modeled using computer simulations.

https://en.wikipedia.org/wiki/Catastrophic_interference

Even when SFT and RL achieve the same performance on a new task, RL tends to preserve prior knowledge much better. The authors argue that the extent of forgetting is driven not by the training algorithm itself, but by the amount of distributional shift.

A key predictor of forgetting is the forward KL divergence between the fine-tuned model and the base model, measured on the new-task distribution: . The authors show theoretically that, under a binary reward, policy gradients are equivalent to EM with information projection, and that on-policy RL converges to the KL-minimal optimal policy among representable policies: . This is the core idea behind RL’s Razor: among multiple solutions that solve the new task, RL selects the one closest to the base model in KL divergence.

RL's Razor: Why Online Reinforcement Learning Forgets Less

Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and...

https://arxiv.org/abs/2509.04259

Catastrophic forgetting

Catastrophic interference

Recommendations