Artificial neural networks are functions that almost satisfy the Lipschitz assumption, making it possible to improve performance in the updated new variable space
This means that the rate of change is constrained by a constant
Lipschitz continuous
When , every satisfy for some for (Lipschitz constant) exists.
is limit condition that constraint rate of change
Lipschitz constant
The maximum ratio of output change to input change, i.e., how many times the output can deviate from the input in the worst case
Lipschitz robustness
Lipschitz robustness is an approach to certifying that a model's output (or loss) does not change significantly even when there are small perturbations in the input, using the Lipschitz constant as a certificate.
In classification tasks, we want predictions to remain stable even when there are small changes around an input , specifically when . If a function has a Lipschitz constant, the output change is bounded:
This enables certification such as "if the output (logit) change is smaller than the margin, the class does not change."
- Margin of logits: difference between the top logit and the second-highest logit
- If the upper bound of logit change caused by perturbation is smaller than the margin, the argmax does not change → we obtain a certified robust radius
For Lipschitz robustness to hold, the following conditions are typically required:
- A meaningful distance metric (norm) must exist in the input space
- Images: distances are naturally defined
- Language: Defining distance between sentences is neither task-invariant (e.g., "happy"→"sad" is OK for grammar tasks but NOT for sentiment tasks) nor clear
- The model/architecture's Lipschitz constant must be meaningfully bounded
- Transformer-based models often fail to produce "good" Lipschitz upper bounds under certain conditions, or when they do, the bounds are too conservative and become vacuous (useless)
0021 Gradient Descent & Momentum - Deepest Documentation
https://deepestdocs.readthedocs.io/en/latest/002_deep_learning_part_1/0021/
Knowledge Continuity
Traditional Lipschitz robustness works well for images (continuous spaces) but is difficult to apply to language (discrete/non-metric spaces). Transformers are not theoretically Lipschitz. There is a lack of universal certification methods for NLP/LLMs. Knowledge Continuity focuses not on "input changes" but rather on "how stable knowledge relationships are in the hidden representation space." In other words, instead of input distance → it uses hidden representation distance + loss changes
- k-volatility: at layer k, the expected ratio of loss change to representation distance; larger values indicate instability
- Knowledge Continuity: if the average volatility across all data is small → robust
Unlike Lipschitz: maintains Universal Approximation even when robust. In CV environments, Knowledge Continuity ≈ Lipschitz compatible

Seonglae Cho