# Conditional Density Estimation with Neural Networks

## Introduction

In the context of supervised learning, our goal is to capture a relationship between some observations $\bold{x} = \lbrace x _i \rbrace _{i=1}^{N}$ and a dependent variable (output) $\bold{y} = \lbrace y _i \rbrace _{i=1}^{N}$ coming from a pair of random variables $(X,Y)$. Regression analysis aims to build this relationship by estimating the conditional mean of $Y$ given input $X$ such that $\mathbb{E}(Y/X) = f(X, \beta)$, where $f$ is a given function and $\beta$ is the parameter to be estimated from observations $\bold{x}$ and $\bold{y}$. However, estimating the conditional expectation may not be informative especially when the conditional distribution has multi-modality or highly asymmetric. The full description of the dependency of $Y$ on $X$ can be described by estimating the conditional density of $Y$ given $X$, written as $p(y|x)$. Estimating $p(y|x)$ from a set of observations $D = \lbrace (x _i, y _i) \rbrace _{i=1}^{N}$ is referred as conditional density estimation (CDE).

In this post I will briefly describe density estimation methods and the link with conditional density estimation, then I will introduce Mixture Density Networks that is based on using Neural Networks to represent parameters of the conditional density estimation. Finally I will show some experimental results of CDE on simulated data (implemented in PyTorch and served via Bokeh).

## Density Estimation Background

Let $X$ be a random variable with an unknown probability density function $p(x)$ over a domain $\mathcal X$. Given a collection of observations $D = \lbrace x _1, …,x _N \rbrace$, our goal is to find a “good” estimation $\hat p(x)$ of real density $p$. Density estimation aims to find the best $\hat p$ among all possible probability density functions on $\mathcal X$. This is not feasible in practice due to finite number of observations and we have to limit space of search. Hereunder we describe two families of methods for density estimation (in the case of unique variable or a conditional density): parametric and non-parametric density estimation.

### Parametric Estimation for CDE

• Density estimation of one variable

For parametric estimation, we suppose that the probability density function $\hat p$ belongs to a family of parametric distributions $\mathcal F = \lbrace \hat p _{\theta}(.) / \theta \in \Theta \rbrace$ where $\theta \in \Theta$ is a finite dimensional parameter. The objective here is to find the parameter $\theta^*$ that would allow us to approximate and be as close as possible to the real distribution $p$. A common criterion is to minimize the Kullback-Leibler divergence $D _{KL}$ between $\hat p _{\theta}$ and $p$ with respect to $\theta$. The Kullback-Leibler divergence is defined as follows

$$D _{KL}(p||\hat p _{\theta}) = \mathbb E _{x \sim p(x)} \big[ \log \frac{p(x)}{\hat p _{\theta}(x)} \big]$$

By using difference of logs property and additivity of the expectation, this can be written as

$$D _{KL}(p||\hat p _{\theta}) = \mathbb E _{x \sim p(x)} \big[ \log p(x)\big] - \mathbb E _{x \sim p(x)} \big[ \log \hat p _{\theta}(x)\big]$$

The first term $\mathbb E _{x \sim p(x)} \big[ \log p(x)\big]$ is the negative entropy and it is independant from $\theta$, thus we will ignore it the minimization process. The second term $- \mathbb E _{x \sim p(x)} \big[ \log \hat p _{\theta}(x)\big]$ is unknown (because we can’t integrate over $x$ that is coming from an unknown distribution $p$). But using the fact that we have some observations $D = \lbrace x _1, …,x _N \rbrace$ sampled from $p$, we can approximate this second term (using strong law of large numbers) as

$$-\mathbb E _{x \sim p(x)} \big[ \log \hat p _{\theta}(x)\big] \simeq - \frac{1}{N} \sum _{i=1}^{N} \log(\hat p _{\theta}(x _i))$$

One should note that this is true only if $N$ goes to infinity (large number of observations). By using this, an approximation of the Kullback-Leibler divergence $D _{KL}(p||\hat p _{\theta})$ can be written as

$$D _{KL}(p||\hat p _{\theta}) \simeq cte - \frac{1}{N} \sum _{i=1}^{N} \log(\hat p _{\theta}(x _i))$$

Therefore, if we minimize $D _{KL}(p||\hat p _{\theta})$ with respect to $\theta$, it is equivalent to minimize the negative log-likelihood of observations $-\frac{1}{N} \sum _{i=1}^{N} \log(\hat p _{\theta}(x _i))$, in other words maximizing the likelihood. $\theta^*$ can be expressed as

$$\theta^* = \underset{\theta \in \Theta}{\argmax} \sum _{i=1}^{N} \log(\hat p _{\theta}(x _i))$$

We just showed that the Maximum Likelihood Estimation (MLE) is in fact equivalent to minimize a monte carlo approximation of the Kullback-Leibler divergence $D _{KL}(p||\hat p _{\theta})$.

To illustrate MLE, a dashboard available at https://nn-cde.imadelhanafi.com gives the ability to simulate data according to different distributions and then estimate the parameters of a distribution that we want to fit on the sampled points. • Conditional Density Estimation

In the case of conditional density estimation, let us consider two random variables $X$ and $Y$ where we would like to exhibit the relationship between these two variables. As explained in the introduction, the conditional probability density of $y$ given $x$, denoted as $p(y|x) = p(x,y)/p(x)$, gives a full description of the dependency of $Y$ on $X$.

Given the dataset $D = \lbrace (x _i, y _i) \rbrace _{i=1}^{N}$ of observations drawn from joint distribution $p(x,y)$, our goal is to find a parametric estimate $\hat p _{\theta}(y|x)$ of $p(y|x)$. Similar computations for minimizing the Kullback-Leibler divergence as in the case of the density of one random variable (see above) leads to the maximum likelihood estimation where $\theta^*$ is estimated as

$$\theta ^* = \underset{\theta \in \Theta}{ \argmax } \sum _{i=1}^{N} \log(\hat p _{\theta}(y _i |x _i))$$

The difficulty of parametric methods resides in choosing the right family of probability density function to approximate the unknown distribution, which is not a trivial task. Also the optimization problem can be very heavy from a computational point of view.

Example

Mixture of models is an example of parametric conditional density estimation. The CDE is written as weighted sum (weights sums to 1) of $K$ densities. The components’ parameters depend on the input $x$.

$$\hat p(y|x) = \sum _{i=1}^{K}\pi _i(x) \phi _i(y ; \xi _i(x))$$

The parameter to be estimated is $\theta = (\pi _i(x),\xi _i(x))$. This formulation is close to Gaussian Mixture Models (GMM)1 for one random variable $\hat p(x) = \sum _{i=1}^{K}\pi _i \mathcal N _i(x ; \xi _i)$ but the main difference is in parameters estimation. In the case of GMM, the parameters do not depend on the input and they are estimated using EM-algorithm for example, while in the case of mixture models for conditional density estimation, the parameters depend on the sampled observations $x$. A detailed example with simulations is given in section Implementation & Experiments.

### Non-Parametric Estimation

• Density estimation of one variable

As opposite to parametric estimation, non-parametric methods do not restrict the density estimations to an explicit family of densities. The most popular and well studied non-parametric method is the kernel density estimation (KDE). The idea behind KDE is to place a symmetric density function (kernel K) at each training data point $x _i$ that are sampled from the unknown density $p(x)$. For a set of i.i.d samples $D = \lbrace x _1,…,x _N\rbrace$, the kernel density estimation is

$$\hat p _h(x) = \frac{1}{N}\sum _{i=1}^N K _h (x - x _i) = \frac{1}{Nh} \sum _{i=1}^N K\Big(\frac{x-x _i}{h}\Big),$$

In the case of multidimensional data, the density function can be estimated as product of marginal kernel density estimates following previous formula on each dimension. The parameter $h$ is called the bandwidth and $K$ is the kernel.

In practice, the hyper-parameter $h$ can be selected using a cross-validation method or following Silverman’s rule2.

In the online dashboard https://nn-cde.imadelhanafi.com KDE tab allows to simulate data according to some given distributions and fit a KDE with a fixed bandwidth. There is also the option to do parameter search (on kernels and bandwidth) using cross-validation. • Conditional Density Estimation

Now given a dataset $D = \lbrace (x _i, y _i)\rbrace _{i=1}^{N}$ of observations coming from a pair of random variables ($X$ , $Y$). The non-parametric approach discussed in previous section can be extended to this case. We can use KDE to estimate the joint density $\hat p(x,y)$ and the marginal $\hat p(x)$, then the conditional density estimate can be expressed as the density ratio

$$\hat p(y|x) = \frac{\hat p(x,y)}{\hat p(x)}$$

The focus of this post is about the next section (Neural Density Estimation). Examples and simulations will be illustrated later.

## Neural Density Estimation

### Principle

As we have shown in the previous example about using Mixture Models as a parametric estimate of the conditional density. The CDE can be expressed as a mixture of $K$ models $$\hat p(y|x) = \sum _{i=1}^{K}\pi _i(x) \phi _i(y ; \xi _i(x))$$

Since the the parameters of densities $\xi _i(x)$ and the weights $\pi _i(x)$ depends on the input $x$, why not using a Neural Network to find a mapping3 between $x$ and $(\pi _i(x), \xi _i(x))$, such that the resulting mixture model is a “good” estimation of the real conditional density $p(y|x)$? This approach has been proposed by Bishop in 1994 and called Mixture Density Netwoks (MDN)4.

Let us take a concrete example and show how the training (finding the weights of the neural network) is done. We consider each density component in the mixture model to be a Gaussian density.

$$\hat p(y|x) = \sum _{i=1}^{K}\pi _i(x ; \theta ) \mathcal N(y | \mu _i(x ; \theta ), \sigma^2 _i(x ; \theta) )$$

Where $\pi _i(x ; \theta)$ is the weight, $\mu _i(x ;\theta)$ the mean and $\sigma^2 _i(x ; \theta )$ the diagonal covarianced of i-th Gaussian component. $\theta$ represents the trainable parameters of a Neural Network.

This is not an arbitrary choice: a mixture of Gaussian distributions can have the property of being a universal approximator of densities5, in the sense that with enough components it can approximate any smooth distribution. $\theta$ can be estimated by minimizing the Kulleback-Leibler divergence. For a set of observations $D = \lbrace ( x _i,y _i)\rbrace _{i=1}^{N}$ sampled independently from $(X,Y)$. This is equivalent to minimizing the negative log-likelihood (see proof in previous section). In other words, the loss function $L(\theta)$ for training the neural network would be equal to the log-likelihood of the data

$$L(\theta) = -\sum _{i=1}^{N} \log\Big( \sum _{j=1}^{K}\pi _j(x _i ; \theta ) \mathcal N(y _i | \mu _j(x _i ; \theta ), \sigma^2 _j(x _i ; \theta) )\Big)$$

At this stage, we can apply any known method for optimizing the parameters of a neural network: SGD, Adam, or recently Adam Rectified 6 to find an optimum. Implementation and examples will be illustrated in a later section.

### Regularization

Neural density estimation methods, like MDN are very flexible and make few assumptions on the unknown conditional density. This will results in over-fitting the observed data $(\bold{x},\bold{y}) =\lbrace (x _i,y _i) \rbrace _{i=1}^{N}$. An obvious way to reduce over-fitting is to use regularisation. Hereunder we will explain two generic regularization frameworks that can be used for MDN.

• Bayesian Regularization

Regularized maximum likelihood can be seen as a maximum posterior (MAP) approach (Bayesian approach) in which the regularizer can be interpreted as the log of a prior distribution on parameters.

Let us do the math! The maximum likelihood estimate (MLE) for CDE can be written as

$$\theta _{ML} = \underset{\theta \in \Theta}{\argmax} \hat p _{\theta}(\bold{y} |\bold{x} )$$

Maximum a-posteriori (MAP) adds a prior distribution on the parameter $\theta$, and maximize the posterior distribution. From Bayes’ rule we have $\text{posterior}\propto \text{likeelihood} \times \text{prior}$, then the MAP estimate for $\theta$ can be written as

$$\theta _{MAP} = \underset{\theta \in \Theta}{\argmax} \hat p _{\theta}(\bold{y} |\bold{x} )p(\theta)$$

Taking the log and using the fact that $(x _i,y _i)$ are i.i.d, this can be written as

$$\theta _{MAP} = \underset{\theta \in \Theta}{\argmax} \sum _{i=1}^{N}(\log\hat p _{\theta}(y _i |x _i )) + \log p(\theta)$$

Now if the prior distribution on $\theta$ is a Gaussian of the form $p(\theta) = \mathcal N(0,\lambda^{-1} I)$, then $\log p(\theta) = \lambda || \theta || _2^{2}+ cte$, which is simply the L2 regularization of the likelihood.

If all elements of $\theta$ follows a zero-mean Laplace distribution7 $p(\theta _i) = Laplace(0,b) =\frac{1}{2b} \exp(\frac{|\theta _i|}{b})$, then $\log p(\theta) = 1/b || \theta || _1+ cte$ which is an L1 regularization.

• Noise Regularization

Adding noise to the training data can be seen as a form of regularization8 9 since it would help to have smoother functions and avoid overfitting to the observed data. Let us introduce two gaussian noise variables $\xi _x$ and $\xi _y$, such that $$\xi _x \sim \mathcal N(0,\nu _x I) \text{ and } \xi _y \sim \mathcal N(0,\nu _y I)$$

Then the perturbed new data would be

$$\tilde \bold x = \bold x + \xi _x \text{ and } \tilde \bold y = \bold y + \xi _y$$

When we want to do a maximum likelihood estimation of conditional density, the loss function (negative log-likelihood) to minimize for a dataset $D = (\bold x, \bold y) =\lbrace (x _i,y _i)\rbrace _{i=1}^{N}$ is written as $$\mathcal L(\bold x, \bold y ; \theta) = -\sum _{i=1}^{N} \log p _{\theta}(y _i|x _i)$$

For the regularized data $\tilde D = (\tilde \bold x,\tilde \bold y )$, the loss would $\mathcal L(\tilde \bold x , \tilde \bold y ; \theta) = \mathcal L(\bold x + \xi _x , \bold y + \xi _y ; \theta)$. This loss function can be approximated using a Taylor expansion around $\bold x$ and $\bold y$ to the second order, then taking expectation over $\xi$, we would obtain the following expanded form

$$\mathcal L( \tilde \bold x , \tilde \bold y ; \theta) \approx \mathcal L(\bold x, \bold y ; \theta) -\sum _{i=1}^{N} \frac{\nu _y}{2} tr(H _y^{(n)}) -\sum _{i=1}^{N} \frac{\nu _x}{2} tr(H _x^{(n)})$$

Where $H _x^{(n)} = \frac{\partial^2 \mathcal L}{\partial^2 x}( x, y)| _{\bold x}$ is the Hessian of $\mathcal L$ with respect to $x$ and evaluated at $\bold x$, and $tr(.)$ is the trace whch is the sum of elements on the main diagonal. I refer the reader to 9 for details of derivations.

As a conclusion, adding noise can be seen as form of regularization on the loss function that can be controlled by $\nu _x$ and $\nu _y$.

## Implementation & Experiments

To illustrate Mixture Density Network, we will consider a dataset coming from a diagonal Gaussian Mixture Model10 with $K$ components and of joint density

$$p(x,y) = \sum _{i=1}^{K}w _i \mathcal N(y | \mu _{i, y} ; \Sigma _{i, y} )\mathcal N(x | \mu _{i, x} ; \Sigma _{i, x})$$

With this model, it is easier to derive $p(y|x)$ by simply removing the contribution of $x$ in the model. The conditional density $p(y|x)$ can be written as

$$p(y|x) = \sum _{i=1}^{K}W _i(x) \mathcal N(y | \mu _{i, y} ; \Sigma _{i, y} )$$

where

$$W _i(x) = \frac{w _i\mathcal N(x | \mu _{i, x} ; \Sigma _{i, x})}{\sum _{j}^{K}w _j\mathcal N(x | \mu _{j, x} ; \Sigma _{j, x})}$$

Now we have the framework, we can simulate a dataset $D = \lbrace (x _i,y _i)\rbrace _{i=1}^{N}$ from $p(x,y)$ (code: scripts/data _simulator.py) and compute the exact conditional density $p(y|x)$ at any point $x$.

The next step would be to train a model (neural network $\theta$) to approximate the conditional density $p(y|x)$ by a mixture of $M$ components

$$\hat p(y|x) = \sum _{i=1}^{M}\pi _i(x ; \theta ) \mathcal N(y | \mu _i(x ; \theta ), \sigma^2 _i(x ; \theta) )$$

By minimising the loss function $L(\theta) = -\sum _{i=1}^{N} \log\Big( \sum _{j=1}^{M}\pi _j(x _i ; \theta ) \mathcal N(y _i | \mu _j(x _i ; \theta ), \sigma^2 _j(x _i ; \theta) )\Big)$ which is the negative log-likelihood of the dataset $D$, we obtain a conditional density estimation based on neural networks (code: scripts/mdn.py).

Here is an example of simulated data with its 3D surface density, a conditional density at $x=-1$ and the estimated conditional density after training the model. The dashboard https://nn-cde.imadelhanafi.com allows multiple choices of hyper-parameters for the training as well as different regularisers (L2, Dropout).

## Conclusion

In this post we showed how to use neural networks for conditional density estimation with a user-friendly dashboard to do the training on simulated data. This post is still far from a complete study of conditional density estimation and many aspects of the implementation are subject to improvement (Train on real datasets like New York City yellow taxi dataset, more options for CDE, possibility to train on user’s data with dimensions > 1 ).

References

1. Gaussian mixture model https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model [return]
2. Bandwidth selection for KDE https://en.wikipedia.org/wiki/Kernel_density_estimation#Bandwidth_selection [return]
3. Universal Approximation Theorem https://en.wikipedia.org/wiki/Universal_approximation_theorem [return]
4. Mixture Density Estimation, Christopher Bishop https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004.pdf [return]
5. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning., p. 65 [return]
6. On the Variance of the Adaptive Learning Rate and Beyond, L Liu et al. https://arxiv.org/pdf/1908.03265.pdf [return]
7. Laplace distribution https://en.wikipedia.org/wiki/Laplace_distribution [return]
8. Training with Noise is Equivalent to Tikhonov Regularization, C. Bishop https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bishop-tikhonov-nc-95.pdf [return]
9. Noise regularization for CDE, J. Rothfuss et al https://arxiv.org/pdf/1907.08982.pdf [return]
10. Conditional Gaussian Mixture Models for Environmental Risk Mapping, N Gilardi et al https://bengio.abracadoudou.com/cv/publications/pdf/rr02-12.pdf [return]