The Ultimate Goal

Generative models that could fit a distribution from samples and then generate more examples from it recently get a staggering development. Many generated images and audio clips are of amazing quality and realism. To be formal, given a random variable $X\in\mathcal{X}$ , we would like to fit an approximate distribution $p_\theta(x_i)\propto \exp{(\theta_i)}$ , where $x_1,\dots,x_n\in\mathcal{X}$ is some discretization.

Most simply, this problem could be solved by minimizing the Kullback-Leibler (KL) divergence, essentially pulling the approximate and the real distribution close. $\theta^*=\arg\min_\theta D_{KL}(p_X\|p_\theta)$ . However, obviously, this is only tractable when the space $\mathcal{X}$ is small and low-dimensional.

Therefore, the development and the capacity of the state-of-the-art generation models are largely built upon the fundamental advances in autoregressive density estimations¹, variational inference², and generative adversarial networks³. Let us now look at how they approach this goal, and what are there common limitations.

What is Already There

The core problem is that to model a high-dimensional joint density distribution requires exponentially many parameters as the dimension going up. The following methods are different approaches to circumvent this issue by making assumptions, simplifications, or viewing it from another perspective.

Autoregressive Models (ARs)

Autoregressive models typically factorize the joint distribution by the product law, and imposes some conditional independence assumptions to reduce the number of conditionals needed. The following formula explains them all. $\sigma:\mathbb{N}_n\rightarrow\mathbb{N}_n$ is a permutation function of dimensions, which is included in the formula to make the indices general.

$p_X(\mathbf{x})=\prod_{i=1}^D p_{X_{\sigma(i)}}\left(x_{\sigma(i)}|x_{\sigma(1)},\dots,x_{\sigma(i-1)}\right)$

The model is usually straightforward, but usually there are some ordering issue. Also, the autoregressive nature tends to make generation slow.

Variational Autoencoders (VAEs)

Another perspective is to represent $p_\theta$ as the marginalization over a latent random variable $Z\in\mathcal{Z}$ . Then with the relation below, maximizing the evidence lower bound makes the approximate $p_\theta$ close to $p_X$ .

$\log{p_\theta(x)}\geq-D_{KL}\left(q_\theta(z|x)\|p(z) \right )+\mathbb{E}\left[\log{p_\theta(x|z)} \right ]$

VAEs are straightforward to implement and optimze, and efficient at generation and capturing structures in high-dimensional spaces. However, VAEs often miss fine-grained details.

Generative Adversarial Networks (GANs)

We could also tackle this problem from another perspective of a two-player zero sum game. We have two players, a generator $G$ and a discriminator $D$ . The generator tries to generate fake example from distribution $p_\theta$ that mimics the true distribution, and the discriminator tries to distinguish fake examples from real data points. The objective then could be written as the following,

$\underset{G}{\arg\min}\sup_D\left[\underset{X}{\mathbb{E}}\log\left(D(X)\right)+\underset{X}{\mathbb{E}}\log\left(1-D(G(Z)))\right)\right]$


Image Credits to Generative Adversarial Networks (GANs) in 50 lines of code (PyTorch)

This task is essentially minimizing the Jenson-Shannon divergence, still some function of KL-divergence. This model is infamous to train stably, and the initiation of the model training is also hard.

Hail to KL-divergence


Image Credits to Understanding Cross-entropy for Machine Learning

As we can see, all the state-of-the-art methods relies on the KL-divergence one way or another. Even the GANs are in effect minimizing some divergence deeply related to KL-divergence. However, it is known having some problem catching the low probability tails in the density function because it is essentially the expected deviation.

But is There Another Choice?

Sure. All above methods try to approximate the density function $p_X$ directly. Why can’t we approximate other function deeply related to $p_X$ such as the cumulative distribution function (CDF) or inverse CDF. In order to achieve this goal, let us look at what tools we already have.

Quantile

Let $X$ be a random variable with CDF $F_X(x)=\mathbb{P}(X\leq x)$ , the $\tau$ -th quantile of $X$ is given by,
$Q_X(\tau)=F_X^{-1}(\tau)=\inf_x\{F_X(x)\geq\tau\}$

This essential means that we need to find a sample point $x$ so that there are $\tau$ portion of the data points lies below the $x$ . To make the example more concrete, let us consider a Gaussian random variable $X\sim\mathcal{N}(5;3)$ , and 0.1-th quantile $Q_X(0.1)\approx1.155$ , 0.9-th quantile $Q_X(0.9)\approx8.845$ .

Quantile Regression

What could quantile do? One step advance from linear regression. Given a dataset $(X,Y)$ , and a quantile $\tau\in(0,1)$ , approximate the conditional quantile function at $\tau$ : $Q_{Y|X}(\tau)=X\beta_\tau$ , under the loss function
$\rho_\tau(u)=\begin{cases} (\tau-1) u & u\leq 0 \\ \tau u & u> 0 \end{cases}$
where $u=Y-X\beta_\tau$ is the error.

As we can see, if we fix a $\tau$ , the formulation is essentially the same as linear regression other than the special loss function. How is this regression useful? Let us look an example.

Suppose you ordered UberEats, and you have the dataset of history delivery data between distance and delivery time. Now you need to give a time range estimate given the distance that covers 80% of the customers’ delivery time. We could fit a 0.1-th and a 0.9-th regression model and give a range between them.

Quantile Regression

Quantile Loss

Take a look again at the expression of quantile loss, we could observe that the penalty for underestimation/overestimation is different, depending on $\tau$ . If we look at the loss function at $\tau=0.1$ , we have
$\rho_{0.1}(u)=\begin{cases} -0.9 u & u\leq 0 \\ 0.1 u & u> 0 \end{cases}$
For underestimation ( $u>0$ ), the penalty is 0.1, but for overestimation, the penalty is -0.9. If the regressor is at the middle of the data blob, then how should it move to minimize the loss? Obviously, to have more underestimation is good, so the regressor will move down to the red line, which is essentially how quantile regression works.

If the quantile loss at any $\tau$ is small though, we could conclude that we captured almost all the details of the distribution, even if the density is low. So could this be our substitute for the KL-divergence?

Modeling from Another Perspective

So, instead of modeling the density directly, we could approximate the inverse CDF. This is almost equivalent because we could deduct the density estimate from the inverse CDF.

Similar to approximating density functions, we have to decide on a factorization of the Quantile function (inverse CDF) in high-dimensional space to make it tractable.

If the CDF is of a single variable $\tau$ , the we need comonotonic property to ensure invertibility (obvious because there is no negative probability, CDF must be non-decreasing along any dimension, which is what comonotonic property essentially implies). $F_X^{-1}(\tau)=(F_{X_1}^{-1}(\tau), F_{X_2}^{-1}(\tau),\dots,F_{X_n}^{-1}(\tau)$ is a very strong assumption, and could hardly be used broadly.
On the other hand, if we use a separate $\tau_i$ for each component, $F_X^{-1}(\vec{\tau})=(F_{X_1}^{-1}(\tau_1), F_{X_2}^{-1}(\tau_2),\dots,F_{X_n}^{-1}(\tau_n)$ , we are assuming independence between all components, which is unrealisticly restrictive for many domains.

So we do the same, factorize the CDF, and make some assumptions on conditional independence.
$\begin{align*} F_X(x)&=\mathbb{P}(X_1\leq x_1,\dots,X_n\leq x_n)=\prod_{i=1}^n F_{X_i|X_{i-1},\dots,X_1}(x_i) \\ F_X^{-1}(\tau_\text{joint})&=\left(F_{X_1}^{-1}(\tau_1),\dots,F_{X_n|X_{n-1},\dots}^{-1}(\tau_n)\right) \end{align*}$

Let’s Reparameterize on Sampled Quantiles

Naturally, since we are approximating the quantile function (inverse CDF), we choose the quantile loss to minimize. However, does this loss really leads to some divergence metric between $p_\theta$ and $p_X$ ? In other words, are we doing the correct thing, eventually approximating the density function?

Validity

Let us compute the expected quantile loss over the distribution for a quantile $q$ , following these steps:

Expand the definition of Expectation.
Split the first integral, and merge one of them with the second.
Split the first integral again, and evaluate the second according to the definition of Expectation again.
Evaluate the first integral by the definition of CDF, and take the second integral by part, where $u=x$ and $dv=f_P(x)dx$ .
Cancel the first two term, and we arrived at the final expression.

$\begin{align*} g_\tau (q)&=\mathbb{E}_{X\sim P}\left[\rho_\tau\left(X-q \right ) \right ]\\ &=\int_{-\infty }^{q}(x-q)(\tau-1)f_P(x)dx+\int_{q}^{\infty}(x-q)\tau f_P(x)dx\\ &=\int_{-\infty}^{q}(q-x)f_P(x)dx+\int_{-\infty}^{\infty}(x-q)\tau f_P(x)dx\\ &=q\int_{-\infty}^{q}f_P(x)dx-\int_{-\infty}^{q}xf_P(x)dx+\left(\mathbb{E}_{X\sim P}\left[X \right ]-q \right )\tau\\ &=qF_P(q)-\left(\left[xF_P(x) \right ]_{-\infty}^q -\int_{-\infty}^q F_P(x)dx\right )+\left(\mathbb{E}_{X\sim P}\left[X \right ]-q \right )\tau\\ &=\int_{-\infty}^q F_P(x)dx+\left(\mathbb{E}_{X\sim P}\left[X \right ]-q \right )\tau\\ \end{align*}$

Obviously, $F_P^{-1}(\tau)$ is the true quantile function, so it minimizes the expected quantile loss over $P$ . Let us get an expression on relative difference,

$\begin{align*} g_\tau(q)-g_\tau(F_P^{-1}(\tau))&=\int_{F_P^{-1}(\tau)}^{q}F_P(x)dx+\left(F_P^{-1}(\tau)-q \right )\tau\\ &=\int_{F_P^{-1}(\tau)}^{q}\left(F_P(x)-\tau \right )dx \end{align*}$

Suppose we have a distribution $Q$ , whose quantile function is $F_Q^{-1}(\tau)$ , then the expected relative loss over all $\tau$ 's are the following. Finally, we observed that there exists some metric on two distributions, called the Quantile Divergence.

$\begin{align*} \mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(F_Q^{-1}(\tau) \right )-g_\tau\left(F_P^{-1}(\tau) \right ) \right ]&=\int_{0}^{1}\left[\int_{F_P^{-1}(\tau)}^{F_Q^{-1}(\tau)}\left(F_P(x)-\tau \right )dx \right ]d\tau\\ \mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(F_Q^{-1}(\tau) \right )\right]&=\underbrace{\int_{0}^{1}\left[\int_{F_P^{-1}(\tau)}^{F_Q^{-1}(\tau)}\left(F_P(x)-\tau \right )dx \right ]d\tau}_{\text{Quantile divergence }q(P,Q)}\\ &\quad+\underbrace{\mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(F_P^{-1}(\tau) \right ) \right ]}_{\text{does not depend on }Q} \end{align*}$

Quantile Divergence

This means that modeling the quantile function with quantile loss does lead to an eventual approximation to the true distribution. Let us have a closer look at how quantile divergence measures the difference between two distributions.


Correction: The integrand should be $F_p(x)-\tau$ , credits to ⁴.

For a given $\tau$ , we can see the integral evaluates to a blue area, and we are summing them over all $\tau$ 's. Therefore, it is obvious that this integral disappears if two quantile function match exactly for every $\tau$ , and this proves the statement that quantile loss will not miss any low density region of the distribution.

Unbiased Estimate

Finally, if we take the gradient over the expected relative quantile loss, we are getting an unbiased estimate to the gradient of quantile divergence. Once again, this proves that the new scheme works, leading to an approximation to the true distribution.

$\begin{align*} \nabla_\theta\mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(\bar{Q_\theta}(\tau) \right ) \right ]&=\mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\mathbb{E}_{X\sim P}\left[\nabla_\theta\rho\left(X-\bar{Q_\theta}(\tau) \right ) \right ]\\ &=\nabla_\theta q\left(P, \bar{Q_\theta}\right ) \end{align*}$

Source of Randomness

We know that, specfically for VAEs, there is a reparameterization trick that separates the source of randomness to a standard Gaussian distribution. Now we models the quantile function, how do we get samples from it? Where is the source of randomness now?

It is $\tau\sim\mathcal{U}([0,1])$ . Since quantile functions are essentially inverse CDFs, taking uniform random $\tau$ , and feeds it to the model will give us a sample back. Here is an illustration on how it works.

Updated reparameterization technique.

Results

Gated PixelCNN⁵ is a model that we try to modify. The original formulation have a location-dependent conditioning variable, which will not be used to condition on the random source $\tau$ . Therefore, the modified version PixelIQN will produce pixel values directly, instead of outputing a discreet distribution over 256 levels of RGB values each.


PixelIQN Architecture, similar to Gated PixelCNN with $\tau$ taking the place of location-dependent conditioning. Credits to ⁴.

The dataset used are CIFAR-10 and ImageNet 32x32, with metrics including Fréchet inception distance (FID) (lower is better) and Inception Score (IS) (higher is better).

Training and Performance


Training curves. Dotted lines correspond to models trained with class-label conditioning. Credits to ⁴.


Inception score and FID for CIFAR-10 and ImageNet. PixelIQN(1) is the small 15-layer version of the model. Models marked * refer to class-conditional training. Credits to ⁴.

Samples


CIFAR-10: Real example images (left), samples generated by PixelCNN (center), and samples generated by PixelIQN (right). Credits to ⁴.


ImageNet 32x32: Real example images (left), samples generated by PixelCNN (center), and samples generated by PixelIQN (right). Credits to ⁴.

Inpainting


Small ImageNet inpainting examples. Left image is the input provided to the network at the beginning of sampling, right is the original image, columns in between show different completions. Credits to ⁴.

Class Conditioning


Class-conditional samples from PixelIQN. Credits to ⁴.

Conclusion

Authors recognized that most current state-of-the-art models is built on top of development of Autoregressive modesl, VAEs, and GANs, which they all employ KL-divergence as the measure between two distributions. Now, we use the quantile function and quantile loss to achieve the same tasks, which may be more suitable for certain tasks that cares for low density region of the distrbution.

Although this new approach will not reduce training/inference time, it signifies an important perspective to look at the density estimation.

van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, 2016c. ↩︎
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. ↩︎
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014. ↩︎
Ostrovski, Georg, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. In International Conference on Machine Learning. PMLR, 2018. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., and Kavukcuoglu, K. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016b. ↩︎