Defense method that makes models robust against adversarial examples
Adversarial Trainings

Adversarial Examples works due to the Superposition Hypothesis
This interference occurs because even a small stimulation of a specific feature can simultaneously disturb other features, allowing attackers to achieve significant effects with minimal perturbations.
For this, the paper proposes that Adversarial Training → increased robustness → reduced superposition → increased interpretability, thus connecting robustness and interpretability