Generalized Distance or Vector Similarity
The positive-definite kernel, denoted as K(x,z), calculates the inner product of feature-mapped inputs. The notation <> represents Vector Similarity, where is the vector kernel mapping function:
Core Properties:
- Symmetry - A valid kernel function must be symmetric
- Semi-definiteness - The kernel matrix must be positive semi-definite
Mathematical Conditions:
- Symmetry: (Commutative Property)
- Positive Semi-definiteness: for any
Key Features and Applications:
- Enables classification of nonlinear data using linear classifiers through higher-dimensional mapping
- Functions as a Feature Map that computes inner products between transformed input data
- Maintains nonlinear characteristics due to the rarity of orthogonal relationships in high-dimensional spaces
- Generates additional dimensions using existing dimensional data
- Serves as similarity metrics, particularly in the case of Gaussian Kernel
Theory
Dual solution (dual representation)
At ridge regression
Setting the derivative of the cost function with respect to to 0 yields the following equation:
Alternatively, we can rewrite the equation in terms of :
Where . is thus a linear combination of the training examples. Dual representation refers to learning by expressing the model parameters as a linear combination of training samples instead of learning them directly (primal representation).
The dual representation with proper regularization enables efficient solution when p>N (Sample-to-feature ratio) as the complexity of the problem depends on the number of examples and instead of on the number of input dimensions .
We have two distinct methods for solving the ridge regression optimization:
- Primal solution (explicit weight vector):
- Dual solution (linear combination of training examples):
The crucial observation about the dual solution is that the information from the training examples is captured via inner products between pairs of training points in the matrix . Since the computation only involves inner products, we can substitute for all occurrences of inner products by a function that computes:
This way we obtain an algorithm for ridge regression in the feature space defined by the mapping .