Lipschitz Assumption

Artificial neural networks are functions that almost satisfy the Lipschitz assumption, making it possible to improve performance in the updated new variable space

This means that the rate of change is constrained by a constant

Lipschitz continuous

When , every satisfy for some for (Lipschitz constant) exists.

is limit condition that constraint rate of change

Lipschitz constant

The maximum ratio of output change to input change, i.e., how many times the output can deviate from the input in the worst case

Lipschitz robustness

Lipschitz robustness is an approach to certifying that a model's output (or loss) does not change significantly even when there are small perturbations in the input, using the Lipschitz constant as a certificate.

In classification tasks, we want predictions to remain stable even when there are small changes around an input , specifically when . If a function has a Lipschitz constant, the output change is bounded:

This enables certification such as "if the output (logit) change is smaller than the margin, the class does not change."

Margin of logits: difference between the top logit and the second-highest logit

If the upper bound of logit change caused by perturbation is smaller than the margin, the argmax does not change → we obtain a certified robust radius

For Lipschitz robustness to hold, the following conditions are typically required:

A meaningful distance metric (norm) must exist in the input space

Images: distances are naturally defined

Language: Defining distance between sentences is neither task-invariant (e.g., "happy"→"sad" is OK for grammar tasks but NOT for sentiment tasks) nor clear

The model/architecture's Lipschitz constant must be meaningfully bounded

Transformer-based models often fail to produce "good" Lipschitz upper bounds under certain conditions, or when they do, the bounds are too conservative and become vacuous (useless)

0021 Gradient Descent & Momentum - Deepest Documentation

https://deepestdocs.readthedocs.io/en/latest/002_deep_learning_part_1/0021/

Knowledge Continuity

Traditional Lipschitz robustness works well for images (continuous spaces) but is difficult to apply to language (discrete/non-metric spaces). Transformers are not theoretically Lipschitz. There is a lack of universal certification methods for NLP/LLMs. Knowledge Continuity focuses not on "input changes" but rather on "how stable knowledge relationships are in the hidden representation space." In other words, instead of input distance → it uses hidden representation distance + loss changes

k-volatility: at layer k, the expected ratio of loss change to representation distance; larger values indicate instability

Knowledge Continuity: if the average volatility across all data is small → robust

Unlike Lipschitz: maintains Universal Approximation even when robust. In CV environments, Knowledge Continuity ≈ Lipschitz compatible

arxiv.org

https://arxiv.org/pdf/2411.01644

Lipschitz Assumption

Lipschitz continuous

Lipschitz constant

Lipschitz robustness

Knowledge Continuity

Backlinks

Recommendations