Kernel Methods

What's the use of Kernels in Machine Learning?

Posted by Andrea Bonvini on December 4, 2020

Prerequisites:

  • Linear Algebra
  • Ridge Regression (only if you want to understand the example provided)

Hi everyone! In this brief blogpost we’ll talk about the use of Kernels in Machine Learning, we’ll start with a little introduction, we’ll then define the concept of kernel and show a bunch of examples. Let’s start.

Traditionally, theory and algorithms of machine learning and statistics have been very well developed for the linear case. Real world data analysis problems, on the other hand, often require nonlinear methods to detect the kind of dependencies that allow successful prediction of properties of interest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a usually high-dimensional (possibly infinite) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the high dimensional feature space! (this is called the Kernel Trick)

Suppose we have a mapping $\varphi : \mathbb{R}^d \to \mathbb{R}^m$ that brings our vectors in to some feature space $\mathbb{R}^m$. Then the dot product of $\textbf{x}$ and $\textbf{y}$ in this space is $\varphi (\textbf{x})^T\varphi (\textbf{y})$.

A kernel is a function $k$ that corresponds to this dot product, i.e. $k(\textbf{x},\textbf{y})=\varphi (\textbf{x})^T\varphi (\textbf{y}) $ .

Why is this useful? Kernels give a way to compute dot products in some feature space without even knowing what this space is and what is $\varphi$ .

For example, consider a simple polynomial kernel $k(\textbf{x},\textbf{y})=(1+\textbf{x}^T\textbf{y})^2$ with $\textbf{x},\textbf{y} \in \mathbb{R}^2$.

This doesn’t seem to correspond to any mapping function $\varphi$ , it’s just a function that returns a real number. Assuming that $\textbf{x} = (x_1,x_2)$ and $\textbf{y} = (y_1,y_2)$, let’s expand this expression:

\[k(\textbf{x},\textbf{y})=(1+\textbf{x}^T\textbf{y})^2 = (1+x_1y_1 + x_2y_2)^2=\\1+x_1^2y_1^2+x_2^2y_2^2+2x_1y_1+2x_2y_2+2x_1x_2y_1y_2\]

Note that this is nothing else but a dot product between two vectors:

\[\varphi(\mathbf x) = \varphi(x_1, x_2) = (1, x_1^2, x_2^2, \sqrt{2} x_1, \sqrt{2} x_2, \sqrt{2} x_1 x_2)\]

and

\[\varphi(\mathbf y) = \varphi(y_1, y_2) = (1, y_1^2, y_2^2, \sqrt{2} y_1, \sqrt{2} y_2, \sqrt{2} y_1 y_2)\]

So the kernel $k(\mathbf x, \mathbf y) = (1 + \mathbf x^T \mathbf y)^2 = \varphi(\mathbf x)^T \varphi(\mathbf y)$ computes a dot product in a 6-dimensional space without explicitly visiting this space.

Another example is the Gaussian kernel $k(\mathbf x, \mathbf y) = \exp\big(- \gamma \, |\mathbf x - \mathbf y|^2 \big)$. If we Taylor-expand this function, we’ll see that it corresponds to an infinite-dimensional codomain of $\varphi$.

Instead, the simplest kernel is the linear kernel which corresponds to an identity mapping in the feature space: $k(\mathbf{x},\mathbf{x’}) = \varphi(\mathbf{x})^T\varphi(\mathbf{x’}) = \mathbf{x}^T\mathbf{x}$

Moreover, the kernel is a symmetric function of its arguments: $k(\mathbf{x},\mathbf{x’}) = k(\mathbf{x’},\mathbf{x})$


Many linear models for regression and classification can be reformulated in terms of dual representation in which the kernel function arises naturally ! For example if we consider a linear ridge regression model we know that we obtain the best parameters by minimizing the regularized sum of squares error function (ridge):

\[L_{\mathbf{w}} = \frac{1}{2}\sum_{n=1}^{N}(\mathbf{w}^T\varphi(\mathbf{x_n})-t_n)^2+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w}=\\ \frac{1}{2}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\]

Where $\Phi$ is the design matrix whose $n^{th}$ row is $\varphi(\mathbf{x}n)^T$ (remember that in $L{\mathbf{w}}$ all the vectors are column vectors) and $\mathbf{t} = (t_1,…,t_N)^T$ is the target vector.

Setting the gradient of $L_{\mathbf{w}}$ w.r.t. $\mathbf{w}$ equal to $0$ we obtain the following:

\[\frac{\partial L_\mathbf{w}}{\partial \mathbf{w}}=0\\ \frac{\partial \left(\frac{1}{2}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\right)}{\partial \mathbf{w}}=0\\ \mathbf{\Phi}^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\lambda\mathbf{w} = 0\\ \mathbf{w} = -\frac{1}{\lambda}\Phi^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)\\ \mathbf{w} = \Phi^T\mathbf{a}\]

Where $\mathbf{a}=-\frac{1}{\lambda}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)$ is a $N\times 1$ vector.

We observe that the coefficients $a_n$ are functions of $\mathbf{w}$. So our definition of $\mathbf{w}$ is function of $\mathbf{w}$ itself…which is surely weird, just wait for it…

We now define the Gram Matrix $\mathbf{K} = \Phi \times \Phi^T$, an $N \times N$ matrix, with elements:

\[K_{nm} = \varphi(\mathbf{x_n})^T\varphi(\mathbf{x_m})=k(\mathbf{x}_n,\mathbf{x}_m)\]

So, given $N$ samples, the Gram Matrix is the matrix of all inner products

\[K = \begin{bmatrix} k(\mathbf{x}_1,\mathbf{x}_1) & \dots & k(\mathbf{x}_1,\mathbf{x}_N) \\ \vdots &\ddots & \vdots\\ k(\mathbf{x}_N,\mathbf{x}_1) & \dots& k(\mathbf{x}_N,\mathbf{x}_N) \end{bmatrix}\]

This will come in handy in a few seconds…

If we substitute $\mathbf{w} = \Phi^T\mathbf{a}$ into $L_{\mathbf{w}}$ we get

\[L_{\mathbf{w}} = \frac{1}{2}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\] \[L_{\mathbf{w}} = \frac{1}{2}\left(\mathbf{\Phi\Phi^T\mathbf{a}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\Phi^T\mathbf{a}}-\mathbf{t}\right)+\frac{\lambda}{2}\left(\Phi^T\mathbf{a}\right)^T\left(\Phi^T\mathbf{a}\right)\] \[L_{\mathbf{a}} = \frac{1}{2}\mathbf{a}^T\Phi\Phi^T\Phi\Phi^T\mathbf{a}-\mathbf{a}^T\Phi\Phi^T\mathbf{t}+\frac{1}{2}\mathbf{t}^T\mathbf{t}+\frac{\lambda}{2}\mathbf{a}^T\Phi\Phi^T\mathbf{a}\]

Guess what? we can rewrite the Loss function in terms of the Gram Matrix !

\[L_{\mathbf{a}} = \frac{1}{2}\mathbf{a}^TKK\mathbf{a}-\mathbf{a}^TK\mathbf{t}+\frac{1}{2}\mathbf{t}^T\mathbf{t}+\frac{\lambda}{2}\mathbf{a}^TK\mathbf{a}\]

By combining $\mathbf{w} = \Phi^T\mathbf{a}$ and $a_n = -\frac{1}{\lambda}(\mathbf{w}^T\varphi(\mathbf{x}_n)-t_n)$, setting the gradient w.r.t $\mathbf{a}$ equal to $0$ and isolating $\mathbf{a}$ we obtain:

\[\mathbf{a}=(K+\lambda\mathbf{I}_N)^{-1}\mathbf{t}\]

Where $I_N$ is the identity matrix of dimension $N$. Consider that $K = N\times N$ and $\mathbf{t} = N\times 1$, so $\mathbf{a} = N \times 1$.

So we can make our prediction for a new input $\mathbf{x}$ by substituting back into our linear regression model:

\[y(\mathbf{x}) = \mathbf{w}^T\varphi(\mathbf{x}) = (\Phi^T\mathbf{a})^T\varphi(\mathbf{x}) = \mathbf{a}^T\Phi\varphi(\mathbf{x})= \mathbf{k}(\mathbf{x})^T(K+\lambda\mathbf{I}_N)^{-1}\mathbf{t}\]

where $\mathbf{k}(\mathbf{x})$ is an $N$-dimensional column vector with elements $k_n(\mathbf{x}) = k(\mathbf{x}_n,\mathbf{x})$.

The good thing is that instead of inverting an $M\times M$ matrix, we are inverting an $N\times N$ matrix! This allows us to work with very high or infinite dimensionality of $\mathbf{x}$.


But how can we build a valid kernel?

We have mainly two ways to do it:

  • By construction: we choose a feature space mapping $ \varphi (\mathbf{x})$ and use it to find the corresponding kernel.

  • It is possible to test whether a function is a valid kernel without having to construct the basis function explicitly. The necessary and sufficient condition for a function $k(\mathbf{x},\mathbf{x}’)$ to be a kernel is that the Gram matrix $K$ is positive semi-definite for all possible choices of the set ${x_n}$. It means that $\mathbf{x}^TK\mathbf{x}\ge 0$ for non-zero vectors $\mathbf{x}$ with real entries, i.e.$\sum_n\sum_m K_{n,m}x_nx_m \ge 0$ for any real number $x_n,x_m$.

    $\implies$Mercer’s Theorem : Any continuous, symmetric, positive semi-definite kernel function $k(\mathbf{x},\mathbf{y})$ can be expressed as a dot product in a high-dimensional space.

New kernels can be constructed from simpler kernels as building blocks; given valid kernels $k_1(\mathbf{x},\mathbf{x}) $ and $k_2(\mathbf{x},\mathbf{x})$ the following new kernels will be valid:

  • $k(\mathbf{x},\mathbf{x}^{‘})=c \cdot k_1(\mathbf{x},\mathbf{x}^{‘})$
  • $k(\mathbf{x},\mathbf{x}^{‘})=f(x)\cdot k_1(\mathbf{x},\mathbf{x}^{‘})\cdot f(x)$
  • $k(\mathbf{x},\mathbf{x}^{‘})=q\left(k_1(\mathbf{x},\mathbf{x}^{‘})\right)$ where $q()$ is a polynomial with non-negative coefficients.
  • $k(\mathbf{x},\mathbf{x}^{‘})=e^{k_1(\mathbf{x},\mathbf{x}^{‘})}$
  • $k(\mathbf{x},\mathbf{x}^{‘})=k_1(\mathbf{x},\mathbf{x}^{‘})+k_2(\mathbf{x},\mathbf{x}^{‘})$
  • $k(\mathbf{x},\mathbf{x}^{‘})=k_1(\mathbf{x},\mathbf{x}^{‘})k_2(\mathbf{x},\mathbf{x}^{‘})$
  • $\dots$