## Gaussian Processes and Machine Learning

Table of Contents

Code and figures: **https://github.com/imadelh/ML-Gaussian-Process**

# Introduction

Consider a training dataset $ D = (\mathcal{X}, \mathcal{Y}) = \lbrace (x_n, y_n) \rbrace _{n=1}^{N} $ of inputs $x_n$ and corresponding noisy observations (targets) $y_n = f(x_n) + \epsilon_n$, where $\epsilon_i$ are independent and identically distributed (i.i.d) random variables that represent the measurement noise and $f$ is an unknown function that describes the relationship between the inputs and the targets. We are therefore interested in inferring the function $f$ that generated the dataset $D$ and generalizes well to new inputs locations $x^{*}$.

# Bayesian Regression - Weights space view

**Maximum Likelihood Estimation**

The relationship between inputs and targets can be described by the conditional density of the targets given the inputs $p(y|x)$. One possible approach to estimate the conditional density $p$ given the training dataset $D$ is to build a parametric approximation $\hat p _{\theta}$ that belongs to a family of parametric distributions $\mathcal F = \lbrace \hat p _{\theta}(.) / \theta \in \Theta \rbrace$. The objective here is to find the parameter $\theta^{*}$ that would allow us to approximate and be as close as possible to the real distribution $p$.

A common criterion is to minimize the Kullback-Leibler divergence $D _{KL}$ between $\hat p _{\theta}$ and $p$ with respect to $\theta$. The Kullback-Leibler divergence is defined as follows:

$$ D _{KL}(p||\hat p _{\theta}) = \mathbb E _{y \sim p(y|x)} \big[ \log \frac{p(y|x)}{\hat p _{\theta}(y|x)} \big]. $$

By using the difference of logs property and additivity of the expectation, $D _{KL}$ can be written as

$$ D _{KL}(p||\hat p _{\theta}) = \mathbb E _{y \sim p(y|x)} \big[ \log p(y|x)\big] - \mathbb E _{y \sim p(y|x)} \big[ \log \hat p _{\theta}(y|x)\big]. $$

The first term $\mathbb E _{y \sim p(y|x)} \big[ \log p(y|x)\big] $ is the negative entropy and it is independent from $\theta$, thus we will ignore it the minimization process. The second term $- \mathbb E _{y \sim p(y|x)} \big[ \log \hat p _{\theta}(y|x)\big]$ is unknown (because we can’t integrate over $y$ that is coming from an unknown distribution $p$). But using the fact that we have a training dataset $D$ where $y_i$ are sampled from $p$, we can approximate this second term (using the **strong law of large numbers**) as

$$ -\mathbb E _{y \sim p(y|x)} \big[ \log \hat p _{\theta}(y|x)\big] \simeq - \frac{1}{N} \sum _{i=1}^{N} \log(\hat p _{\theta}(y _i|x _i)). $$

Using this approximation, we can write the Kullback-Leibler divergence as

$$ D _{KL}(p||\hat p _{\theta}) \simeq cte - \frac{1}{N} \sum _{i=1}^{N} \log(\hat p _{\theta}(y _i|x _i)). $$

Therefore, minimizing $D _{KL}(p||\hat p _{\theta})$ with respect to $\theta$ is equivalent to maximize the log-likelihood of the data and $\theta^*$ can be expressed as

$$ \theta^* \in \underset{\theta \in \Theta}{\argmax} \sum _{i=1}^{N} \log(\hat p _{\theta}(y_i|x_i)). $$

(More details are available in the conditional density estimation post).

*Example of linear regression*

Linear regression is a special case where the family of parametric distributions used to approximate the conditional density $p(y|x)$ is a Gaussian distribution.

The model (conditional density) is given by

$$ p _{\theta}(y_i|x_i) = \mathcal{N}(y_i | f _{\theta}(x_i) , \beta^{-1}), $$

where $ f _{\theta}(x) = \sum _{j=1}^{K} {\theta_j \phi_j(x)} = \theta^T \phi(x) $. $\phi_j(x)$ are known as *basis functions* and $(\theta, \beta)$ are the parameters of the regression.

With this model, the functional relationship between $x$ and $y$ is given as

$$ y = f _{\theta}(x) + \epsilon, $$

where $\epsilon \sim \mathcal{N}(0, \beta^{-1})$ is a Gaussian noise with mean $0$ and variance $\beta^{-1}$.

We can determine the parameters $(\theta, \beta)$ by maximizing the log-likelihood function, as explained in the previous section.

The log-likelihood is a function of the adjustable parameters $(\theta, \beta)$, in the form

$$ \log(\hat p _{\theta, \beta}(\mathcal{Y}|\mathcal{X})) = \sum _{i=1}^{N} \mathcal{N}(y_i | f _{\theta}(x_i) , \beta^{-1}) = \frac{N}{2}\log(\beta) - \frac{N}{2}\log(2\pi) - \beta L(\theta), $$

where the sum of squares error function $L$ is defined by $$ L(\theta) = \frac{1}{2} \sum _{i=1}^{N} (y _i - \theta^T \phi(x _i)). $$

We refer the reader to Bishop’s PRML book (page 141) ^{1} for further computations.

**Regularization - Priors**

Using the maximum likelihood estimation to estimate the regression parameters may result in overfitting the data. We observe that the magnitude of the parameters’ values becomes large in the case of overfitting.

To avoid the effect of huge parameter values, we place a prior $p(\theta)$ on the parameters. The prior distribution represents the possible values that the parameters can have (before we see the data).

Given a training dataset $D = (\mathcal{X}, \mathcal{Y})$, we seek for parameters that maximize the posterior distribution $p(\theta | \mathcal{X}, \mathcal{Y}) $ instead of maximizing the likelihood.

The posterior over the parameters $\theta$ is obtained by applying Bayes’ theorem

$$ p(\theta | \mathcal{X} , \mathcal{Y} ) = \frac{ p(\mathcal{Y} | \mathcal{X}, \theta) p(\theta) } {\int_{\theta}{p(\mathcal{Y} | \mathcal{X}, \theta) p(\theta) d\theta}} \propto \text{Likelihood} \times \text{Prior}. $$

The parameter vector $\theta_{MAP}$ that maximizes the posterior is called the *maximum a posteriori* (MAP) estimate. To find the MAP estimate we follow a similar approach as in the case of maximum likelihood by taking the log-transform

$$ \log{p(\theta | \mathcal{X} , \mathcal{Y} ) } = \log{p(\mathcal{Y} | \mathcal{X}, \theta )} + \log{p(\theta)} + cte. $$

The log-posterior is the sum of the data-dependent log-likelihood and the log-prior. $\theta_{MAP}$ is obtained by solving the following optimization problem

$$ \theta _{MAP} \in \underset{\theta \in \Theta}{\argmax} (\log{p(\mathcal{Y} | \mathcal{X}, \theta )} + \log{p(\theta)}). $$

If the prior distribution on $\theta$ is a Gaussian distribution of the form $p(\theta) = \mathcal N(0,\lambda^{-1} I)$, then $\log p(\theta) = \lambda || \theta || _2^{2}+ cte$, which is simply the L2 regularization of the maximum likelihood estimation

If all elements of $\theta$ follows a zero-mean Laplace distribution^{2} $p(\theta _i) = Laplace(0,b) =\frac{1}{2b} \exp(\frac{|\theta _i|}{b})$, then $\log p(\theta) = 1/b || \theta || _1+ cte$ which is the L1 regularization.

**Fully Bayesian treatment of regression**^{3}

The Bayesian regression extends the idea of using priors a step further. Instead of computing a point estimate of the parameters $\theta$, we search for the full posterior distribution over the parameters after observing the training dataset. This posterior distribution will be taken into account when making predictions on new observation $x^{*}$.

In the Bayesian regression, we consider the following model:

- Prior over parameters: $ \theta \sim p(\theta) $
- Likelihood model: $y \sim p_{\theta}(y | x) = p(y | x, \theta)$

For a new observation $x^{*}$, we would like to compute the posterior predictive distribution over targets $p(y^{*} | \mathcal{X}, \mathcal{Y}, x^{*}) $ given the training dataset $(\mathcal{X}, \mathcal{Y})$ and the observation $x^{*}$

$$ p(y^{*} | \mathcal{X}, \mathcal{Y}, x^{*}) = \int _{\theta} p(y^{*} | x^{*}, \theta) p(\theta | \mathcal{X}, \mathcal{Y}) d\theta = \mathbb{E} _{\theta} [ p(y _{*} | x _{*}, \theta) ].$$

The first term inside the integral $p(y^{*} | x^{*}, \theta)$ is defined by the likelihood model and the second term is the posterior distribution over parameters after observing the training dataset. From Bayes’ rule, we have $\text{posterior}\propto \text{likelihood} \times \text{prior}$

$$ p(\theta | \mathcal{X} , \mathcal{Y} ) \propto p(\mathcal{Y} | \mathcal{X}, \theta) p(\theta). $$

In the case of Conjugate priors^{4}, the posterior distribution $p(\theta | \mathcal{X} , \mathcal{Y} )$ can be computed analytically, and we can therefore estimate the posterior predictive distribution $p(y^{*} | \mathcal{X}, \mathcal{Y}, x^{*}) $.

In the case of non-conjugate priors, we don’t have a closed-form for the posterior, and we need to do some approximations or Monte Carlo methods to get samples from the posterior distribution (see section ‘Approximations’).
Markov chain Monte Carlo (MCMC)^{5} methods allow sampling from a posterior distribution knowing a function proportional to its density function, in this case, we have access to $p(\mathcal{Y} | \mathcal{X}, \theta) p(\theta)$ which is proportional to the unknown density function of the posterior distribution.

**From Bayesian Regression to Gaussian Processes**

In the case of Bayesian regression, we have a parametric relationship between observations $x$ and targets $y$

$$ y(x) = f _{\theta}(x) + \epsilon \; (*),$$

where $\theta$ is sampled from a prior distribution $\theta \sim p(\theta)$, and the relationship (*) is corrupted by a measurement noise $\epsilon$.

For any given value of $\theta$, the relationship (*) defines a particular function of $x$. The probability distribution over $\theta$ induces a probability distribution over functions $y(x)$.

To further illustrate this, let us look at the example of Bayesian linear regression with $K$ basis functions (features) where we omit the noise term for simplicity.

$$ y(x) = f _{\theta}(x) = \sum _{j=1}^{K} {\theta_j \phi_j(x)} = \theta^T \phi(x) \; (**) $$

with a prior distribution $ \theta \sim \mathcal{N}(0, \alpha^{-1} I) $ governed by the hyper-parameter $\alpha$.

We wish to evaluate this function at specific values of $x$. For example, for N training observations $(x_1, …, x_N)$, we are interested in the joint distribution of the function values $y(x _1),…,y(x _N)$ denoted as $\bold{y}$. From $(**)$, $\bold{y}$ is given by $$ \bold{y} = \Phi \theta,$$

where $\Phi$ is the *design* matrix with elements $\phi _{nk} = \phi _{k}(x _n)$ and $\theta$ a vector of $K$ parameters. $\bold{y}$ is a linear combination of Gaussian distributed variables given by $\theta$, therefore $\bold{y}$ follows a Gaussian distribution and we need only to find its mean and covariance, which are given as follows:

$$ \mathbb{E}(\bold{y}) = \Phi \mathbb{E}(\theta) = 0$$

$$ \text{cov}(\bold{y}) = \mathbb{E}(\bold{y} \bold{y}^T) = \Phi \mathbb{E}(\theta \theta^T) \Phi^T = \frac{1}{\alpha} \Phi \Phi^T $$

This model provides a particular example of a Gaussian process. As we will see in the following section, a Gaussian process is defined as a probability distribution over functions $y(x)$ such that the set of values of $y(x)$ evaluated at an arbitrary set of points $(x _1,…, x _N)$ jointly have a Gaussian distribution. We can think of a Gaussian process as defining a distribution over functions, and inference is taking place directly in the space of functions.

# Gaussian Process - Functions space view

**Definition**^{6}

A Gaussian process is defined as a probability distribution over functions $f(x)$ such that the set of values $(f(x_1)…f(x_n))$ evaluated at arbitrary points jointly have a Gaussian distribution. We write

$$ f(x) \sim GP(m(x), K(x,x \prime)). $$

For any finite set of points $X = (x_1, …, x_n)$, this process defines a joint Gaussian such that

$$ p(\bold{f}|X) = \mathcal{N}(\bold{f}|\mu, K), $$

where $\mu = (m(x_1),…, m(x_n))$ is the mean function and $K$ is the kernel or covariance function such that $K _{ij} = K(x _i,x _j)$.

**Fully Bayesian Gaussian Process Regression**

Consider a dataset of observations $ D = (\mathcal{X}, \mathcal{Y}) = \lbrace (x_n, y_n) \rbrace _{n=1}^{N} $, $y_i$ are noisy realizations of some latent function $f$ corrupted with a gaussian noise $y_i = f_i + \epsilon_i$, where $f_i = f(x_i)$ and the noise $\epsilon_i$ are independent identically distributed.

We consider the general framework where the hyperparameters of the Kernel function are sampled from a given distribution.
This is known as **hierarchical Gaussian Process regression** and it has the following assumptions:

- $\bold{f}$ is a Gaussian Process with a kernel $K_{w} = K_w(\mathcal{X},\mathcal{X}) = K_w(x_i,x_j) _{i,j \in {1, .., N}}$ parametrized with a hyperparameter $w$

$$w \sim p(w)$$

$$ \bold{f}|\mathcal{X} \sim GP(0, K_{w})$$

- The likelihood of the data follows a Gaussian distribution

$$\mathcal{Y}|\bold{f} \sim \mathcal{N}(\bold{f}, \sigma^2 \mathbb{I})$$

To simplify the notation, let us represent the hyperparameters $(w, \sigma)$ by $\xi$.

It follows from the independence assumption on noise that $\text{cov}(\mathcal{Y}) = K_w(\mathcal{X},\mathcal{X}) + \sigma^2 \mathbb{I}$ and $\mathbb{E}(\mathcal{Y}) = 0$

Let us consider a set of test locations $\mathcal{X _{*}} = (x^{*} _{1}, …, x^{*} _{p})$. The predictive posterior distribution of the function values at test locations $\bold{f} _{*} = (f(x^{*} _{1}), …, f(x^{*} _{p}))$ can be expressed as

$$ p(\bold{f_{*}}|\mathcal{Y}) = \int{p(\bold{f _{*}} | \mathcal{Y}, \xi) p(\xi| \mathcal{Y}) d\xi }, $$

where we have omitted the conditioning over $\mathcal{X}, \mathcal{X} _{*}$ for brevity.

From the assumptions of the Gaussian Process regression, the joint distributions of the observed target values $\mathcal{Y}$ and the function values at test locations $\bold{f} _{*} = (f(x^{*} _{1}), …, f(x^{*} _{p}))$ is given by

$$ \left[\begin{array} {rrr} \mathcal{Y} \\ \bold{f} _{*} \end{array}\right] \sim \mathcal{N}(\bold{0} , \left[\begin{array} {rrr} K_w(\mathcal{X},\mathcal{X}) + \sigma^2 \mathbb{I} & K_w(\mathcal{X},\mathcal{X _{*}}) \\ K_w(\mathcal{X _{*}},\mathcal{X}) & K_w(\mathcal{X _{*}},\mathcal{X _{*}}) \end{array}\right] ).$$

Given the joint distribution of $(\mathcal{Y}, \bold{f} _{*})$, we can write the conditional distribution of $\bold{f} _{*} $ given $\mathcal{Y}$ for a fixed value of the hyperparameters $\xi$ as

$$ p(\bold{f _{*}} | \mathcal{Y}, \xi) = \mathcal{N}(\mu^{*}, \Sigma^{*}), $$

where $\mu^{*} = K_w(\mathcal{X _{*}},\mathcal{X}) [K_w(\mathcal{X},\mathcal{X}) + \sigma^2 \mathbb{I}]^{-1} \mathcal{Y} $

and $ \Sigma^{*} = K_w(\mathcal{X _{*}},\mathcal{X _{*}}) - K_w(\mathcal{X _{*}},\mathcal{X}) [K_w(\mathcal{X},\mathcal{X}) + \sigma^2 \mathbb{I}]^{-1} K_w(\mathcal{X},\mathcal{X _{*}})$.

(See section A.2 in ^{7} for the proof details).

The hierarchical predictive posterior is reduced to (using Monte-Carlo approximation of the integral) $$ p(\bold{f_{*}}|\mathcal{Y}) = \int{p(\bold{f _{*}} | \mathcal{Y}, \xi) p(\xi| \mathcal{Y}) d\xi } \simeq \frac{1}{M} \sum _{j=1}^{M} p(\bold{f _{*}} | \mathcal{Y}, \xi_j) \text{ where } \xi_j \sim p(\xi|\mathcal{Y}). $$

$ \xi _i$ are drawn from the hyperparameter posterior distribution. This posterior $ p(\xi|\mathcal{Y}) $ is intractable but we can get samples using a Monte Carlo sampling method (see next section). In the case where we consider the hyperparameters are taking fixed values, we can find the optimal values of the hyperparameters by maximizing the marginal log-likelihood (marginalized over f) $log(p(\mathcal{Y} | \mathcal{X}) = \log(\mathcal{N}(0, K_w + \sigma^2 \mathbb{I}))$ with respect to $w$ and $\sigma$.

# Approximations - Estimation of the hyperparameters

From the previous section, we expressed the posterior predictive function as

$$ p(\bold{f_{*}}|\mathcal{Y}) = \int{p(\bold{f _{*}} | \mathcal{Y}, \xi) p(\xi| \mathcal{Y}) d\xi }, $$

where $p(\bold{f _{*}} | \mathcal{Y}, \xi) = \mathcal{N}(\mu^{*}, \Sigma^{*})$ and $p(\xi| \mathcal{Y}) \propto p(\mathcal{Y} | \xi) p(\xi)$ is the posterior distribution of the hyperparameters after observing the training data. The parameters $(\mu^{*}, \Sigma^{*})$ depend on the hyperparameter $\xi$ (see the closed form in the previous section).

The prior $p(\xi)$ is known from the Gaussian Process regression assumptions and the marginal likelihood (marginalized over f) is given by $p(\mathcal{Y} | \mathcal{X}, \xi) = \mathcal{N}(\mathbb{0}, K_w(\mathcal{X},\mathcal{X}) + \sigma^2 \mathbb{I})$ where $\xi = (w, \sigma)$.

- Fixed hyperparameters

In the case where the hyperparameters (trainable parameters of the Gaussian process regression) take fixed unknown values, we can simply infer them from the training data by maximizing the marginal log-likelihood $\log(p(\mathcal{Y} | \mathcal{X}, \xi))$

$$ \xi^{*} \in \underset{\xi }{\argmax} (\log{p(\mathcal{Y} | \mathcal{X}, \xi )}), $$

and the posterior predictive distribution is given by

$$ p(\bold{f_{*}}|\mathcal{Y}) =p(\bold{f _{*}} | \mathcal{Y}, \xi^{*}). $$

- Distribution over hyperparameters

In the case of hierarchical Gaussian Process regression, the predictive posterior distribution is approximated by

$$ p(\bold{f_{*}}|\mathcal{Y}) = \int{p(\bold{f _{*}} | \mathcal{Y}, \xi) p(\xi| \mathcal{Y}) d\xi } \simeq \frac{1}{M} \sum _{j=1}^{M} p(\bold{f _{*}} | \mathcal{Y}, \xi_j) \text{ where } \xi_j \sim p(\xi|\mathcal{Y}). $$

The posterior distribution of the hyperparameters is proportional to the prior distribution multiplied by the marginal likelihood

$$ p(\xi|\mathcal{Y}) \propto p(\mathcal{Y}|\xi)p(\xi), $$

where $p(\mathcal{Y}|\xi) = \mathcal{N}(\mathbb{0}, K_w + \sigma^2 \mathbb{I})$ and $p(\xi)$ is the prior over $\xi$.

*Sampling methods* like Markov chain Monte Carlo methods create samples from a random variable with a probability density function proportional to a known function. In our case, we can have access to $p(\mathcal{Y}|\xi)p(\xi)$ which is proportional to the posterior distribution over $\xi$ from where we want to get samples.
Another family of approximate algorithms called *variational inference methods* formulate inference as an optimization problem.

We refer the reader to this article for more details about sampling methods: https://ermongroup.github.io/cs228-notes/inference/sampling/.

Numpyro https://github.com/pyro-ppl/numpyro and other probabilistic programming libraries (PyMC, Pyro, TF Probability,…) gives easy access to different sampling methods. The user needs to define a model (the likelihood) and the prior of its parameters and then select a sampling method. Hereunder an example of the MCMC sampling function (source code: inference/mcmc.py).

```
from numpyro.infer import MCMC, NUTS, HMC
import time
# Helper function for doing NUTS inference
def mcmc_inference(model, num_warmup, num_samples, num_chains, rng_key, X, Y):
""""
:param model: a parametric function proportional to the posterior (see gp_regression.likelihood).
:param num_warmup: warmup steps.
:param num_samples: number of samples.
:param num_chains: number of Markov chains used for MCMC sampling.
:param rng_key: random seed.
:param X: X data.
:param Y: Y data.
:return: Dictionary key: name of parameter (from defined in model), value: list of samples.
"""
start = time.time()
kernel = NUTS(model)
mcmc = MCMC(kernel, num_warmup, num_samples, num_chains=num_chains)
mcmc.run(rng_key, X, Y)
print('\nMCMC time:', time.time() - start)
print(mcmc.print_summary())
return mcmc.get_samples()
```

# Conclusion and Limitations

In this post, we explored Gaussian processes (GPs) for regression and their links to the Bayesian regression. GPs are flexible non-parametric models. However, one of the main limitations of GPs is the computational complexity of $\mathcal{O}(N^3)$ due to the need to invert an $N \times N$ matrix.
Many approximation methods have been introduced to solve the scalability problem of Gaussian processes on large datasets. A recent paper ^{8} in NeurIPS suggested a new approach to train exact GP on large datasets using a recent inference procedure called *Blackbox Matrix-Matrix multiplication*.

- PRML book - C. Bishop 2006 https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
^{[return]} - Laplace distribution https://en.wikipedia.org/wiki/Laplace_distribution
^{[return]} - Mathematics for Machine Learning - https://mml-book.github.io/
^{[return]} - Conjugate priors - https://en.wikipedia.org/wiki/Conjugate_prior
^{[return]} - Markov chain Monte Carlo (MCMC) - https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo
^{[return]} - Machine Learning: A Probabilistic Perspective - K. Murphy 2012
^{[return]} - Gaussian Processes for Machine Learning - http://www.gaussianprocess.org/gpml/chapters/
^{[return]} - Gaussian Processes on a million data points http://papers.nips.cc/paper/9606-exact-gaussian-processes-on-a-million-data-points.pdf
^{[return]}