
Adversarial Examples works due to the Superposition Hypothesis
This interference occurs because even a small stimulation of a specific feature can simultaneously disturb other features, allowing attackers to achieve significant effects with minimal perturbations.
For this, the paper proposes that Adversarial Training → increased robustness → reduced superposition → increased interpretability, thus connecting robustness and interpretability
influence humans too