NTK theory
Neural network over-parameterizes but good at generalization which is similar to the kernel method. Neural Network to linear model by Taylor Series expansion approximation which is similar to the Back Propagation. Back propagation’s gradient flow can always find solution to make loss as zero since NTL can convert ANN with infinite nodes (parameters) and epsilon learning rate. It is called "tangent" because the kernel maps how the gradient of the output (y) changes linearly (i.e., in a tangent-like manner) with respect to small changes in the network's weights.
Universal Approximation Theorem explains the results of ability while NTL focuses on gradient flow and training dynamics.
As a result, linearity means that we can write a closed form solution for the training dynamics, and this closed form solution depends critically on the neural tangent kernel. Each element of the neural tangent kernel consists of the inner product of the vectors of derivatives for a pair of training data examples. This can be calculated for any network and we call this the empirical NTK. If we let the width become infinite, then we can get closed-form solutions, which are referred to as analytical NTKs.
Fourier featuring
Analyzing gradient descent using this, we can understand the convergence properties of neural networks. By decomposing the dynamics of the output using the chain-rule under the least-square loss assumption, we can obtain an ODE composed of NTK. In the over-parameterized regime, the kernel is positive-definite, allowing for eigenvalue decomposition. At this point, the eigenvalue is the convergence rate of the neural network.
It can be seen that Fourier features help NN learn high-frequency information better. Generally, natural data have large magnitudes of low-frequency information and small magnitudes of high-frequency information. Analyzing the NTK of inputs that have passed through Fourier features shows that the eigenvalue falloff is not greater than that of the traditional MLP NTK. This indicates that NN can sufficiently learn up to high frequencies.