Kernel Function

Creator
Creator
Seonglae Cho
Created
Created
2023 Apr 17 7:39
Editor
Edited
Edited
2025 Jun 2 15:58

Generalized
Distance
or
Vector Similarity

The positive-definite kernel, denoted as K(x,z)K(x,z), calculates the inner product of feature-mapped inputs. The notation <> represents
Vector Similarity
, where ϕ(x)\phi(x) is the vector kernel mapping function:
Knm=Kmn=K(xn,xm)=<ϕ(xn),ϕ(xm)>K_{nm} = K_{mn} = K(x_n, x_m) = <\phi(x_n), \phi(x_m)>
Core Properties:
  1. Symmetry - A valid kernel function must be symmetric
  1. Semi-definiteness - The kernel matrix must be positive semi-definite
Mathematical Conditions:
  • Positive Semi-definiteness: cicjk(xi,xj)0\sum\sum c_i c_j k(x_i,x_j) \ge 0 for any c,xc, x
K(x,z)=<ϕ(x),ϕ(z)>=ϕ(x)ϕ(z)K(x,z) = <ϕ(x), ϕ(z)> = ϕ(x) \cdot ϕ(z)
Key Features and Applications:
  • Enables classification of nonlinear data using linear classifiers through higher-dimensional mapping
  • Functions as a
    Feature Map
    that computes inner products between transformed input data
  • Maintains nonlinear characteristics due to the rarity of orthogonal relationships in high-dimensional spaces
  • Generates additional dimensions using existing dimensional data

Theory

Dual solution (dual representation)

At ridge regression α=argminα  1n2i,j=1n(αiyi)(αjyj)k(xi,xj)+λα2\alpha^* = \underset{\alpha}{\arg\min} \; \frac{1}{n^2} \sum_{i,j=1}^n (\alpha_i - y_i)(\alpha_j - y_j) k(\mathbf{x}_i, \mathbf{x}_j) + \lambda \|\boldsymbol{\alpha}\|^2
Setting the derivative of the cost function with respect to w\mathbf{w} to 0 yields the following equation:
X(Xwy)+λw=0(XX+λI)w=Xyw=(XX+λI)1Xy\mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{y}) + \lambda \mathbf{w} = 0 \\ (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) \mathbf{w} = \mathbf{X}^\top \mathbf{y} \\ \mathbf{w} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}
Alternatively, we can rewrite the equation in terms of α\alpha:
X(Xwy)+λw=0w=iαixi=Xαα=(K+λI)1y\mathbf{X}^\top (\mathbf{X} \mathbf{w} - \mathbf{y}) + \lambda \mathbf{w} = 0 \\ \mathbf{w} = \sum_i \alpha_i \mathbf{x}_i = \mathbf{X}^\top \boldsymbol{\alpha} \\ \boldsymbol{\alpha} = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}
Where Kij=xixj\mathbf{K}_{ij} = \mathbf{x}_i^\top \mathbf{x}_j. ww is thus a linear combination of the training examples. Dual representation refers to learning by expressing the model parameters ww as a linear combination of training samples instead of learning them directly (primal representation).
The dual representation with proper regularization enables efficient solution when p>N (
Sample-to-feature ratio
) as the complexity of the problem depends on the number of examples NNand instead of on the number of input dimensions pp.
We have two distinct methods for solving the ridge regression optimization:
  • Primal solution (explicit weight vector):
w=(XX+λI)1Xy\mathbf{w} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}
  • Dual solution (linear combination of training examples):
α=(K+λI)1y,w=Xα \boldsymbol{\alpha} = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}, \quad \mathbf{w} = \mathbf{X}^\top \boldsymbol{\alpha}
The crucial observation about the dual solution is that the information from the training examples is captured via inner products between pairs of training points in the matrix K=XX\mathbf{K} = \mathbf{X} \mathbf{X}^\top. Since the computation only involves inner products, we can substitute for all occurrences of inner products by a function K\mathbf{K} that computes:
Knm=Kmn=K(xn,xm)=<ϕ(xn),ϕ(xm)>K_{nm} = K_{mn} = K(x_n, x_m) = <\phi(x_n), \phi(x_m)>
This way we obtain an algorithm for ridge regression in the feature space FF defined by the mapping ϕ:xϕ(x)F\phi : x \rightarrow \phi(x) \in F.
 
 
 
 
 
 
 

Recommendations