The Ultimate Goal

Generative models that could fit a distribution from samples and then generate more examples from it recently get a staggering development. Many generated images and audio clips are of amazing quality and realism. To be formal, given a random variable XXX\in\mathcal{X}, we would like to fit an approximate distribution pθ(xi)exp(θi)p_\theta(x_i)\propto \exp{(\theta_i)}, where x1,,xnXx_1,\dots,x_n\in\mathcal{X} is some discretization.

Most simply, this problem could be solved by minimizing the Kullback-Leibler (KL) divergence, essentially pulling the approximate and the real distribution close. θ=argminθDKL(pXpθ)\theta^*=\arg\min_\theta D_{KL}(p_X\|p_\theta). However, obviously, this is only tractable when the space X\mathcal{X} is small and low-dimensional.

Therefore, the development and the capacity of the state-of-the-art generation models are largely built upon the fundamental advances in autoregressive density estimations1, variational inference2, and generative adversarial networks3. Let us now look at how they approach this goal, and what are there common limitations.

What is Already There

The core problem is that to model a high-dimensional joint density distribution requires exponentially many parameters as the dimension going up. The following methods are different approaches to circumvent this issue by making assumptions, simplifications, or viewing it from another perspective.

Autoregressive Models (ARs)

Autoregressive models typically factorize the joint distribution by the product law, and imposes some conditional independence assumptions to reduce the number of conditionals needed. The following formula explains them all. σ:NnNn\sigma:\mathbb{N}_n\rightarrow\mathbb{N}_n is a permutation function of dimensions, which is included in the formula to make the indices general.

pX(x)=i=1DpXσ(i)(xσ(i)xσ(1),,xσ(i1)) p_X(\mathbf{x})=\prod_{i=1}^D p_{X_{\sigma(i)}}\left(x_{\sigma(i)}|x_{\sigma(1)},\dots,x_{\sigma(i-1)}\right)

The model is usually straightforward, but usually there are some ordering issue. Also, the autoregressive nature tends to make generation slow.

Variational Autoencoders (VAEs)

Another perspective is to represent pθp_\theta as the marginalization over a latent random variable ZZZ\in\mathcal{Z}. Then with the relation below, maximizing the evidence lower bound makes the approximate pθp_\theta close to pXp_X.

logpθ(x)DKL(qθ(zx)p(z))+E[logpθ(xz)] \log{p_\theta(x)}\geq-D_{KL}\left(q_\theta(z|x)\|p(z) \right )+\mathbb{E}\left[\log{p_\theta(x|z)} \right ]

VAEs are straightforward to implement and optimze, and efficient at generation and capturing structures in high-dimensional spaces. However, VAEs often miss fine-grained details.

Generative Adversarial Networks (GANs)

We could also tackle this problem from another perspective of a two-player zero sum game. We have two players, a generator GG and a discriminator DD. The generator tries to generate fake example from distribution pθp_\theta that mimics the true distribution, and the discriminator tries to distinguish fake examples from real data points. The objective then could be written as the following,

argminGsupD[EXlog(D(X))+EXlog(1D(G(Z))))] \underset{G}{\arg\min}\sup_D\left[\underset{X}{\mathbb{E}}\log\left(D(X)\right)+\underset{X}{\mathbb{E}}\log\left(1-D(G(Z)))\right)\right]

Intuition behind GAN's two-player game.
Image Credits to Generative Adversarial Networks (GANs) in 50 lines of code (PyTorch)

This task is essentially minimizing the Jenson-Shannon divergence, still some function of KL-divergence. This model is infamous to train stably, and the initiation of the model training is also hard.

Hail to KL-divergence

KL-div
Image Credits to Understanding Cross-entropy for Machine Learning

As we can see, all the state-of-the-art methods relies on the KL-divergence one way or another. Even the GANs are in effect minimizing some divergence deeply related to KL-divergence. However, it is known having some problem catching the low probability tails in the density function because it is essentially the expected deviation.

But is There Another Choice?

Sure. All above methods try to approximate the density function pXp_X directly. Why can’t we approximate other function deeply related to pXp_X such as the cumulative distribution function (CDF) or inverse CDF. In order to achieve this goal, let us look at what tools we already have.

Quantile

Let XX be a random variable with CDF FX(x)=P(Xx)F_X(x)=\mathbb{P}(X\leq x), the τ\tau-th quantile of XX is given by,
QX(τ)=FX1(τ)=infx{FX(x)τ}Q_X(\tau)=F_X^{-1}(\tau)=\inf_x\{F_X(x)\geq\tau\}

This essential means that we need to find a sample point xx so that there are τ\tau portion of the data points lies below the xx. To make the example more concrete, let us consider a Gaussian random variable XN(5;3)X\sim\mathcal{N}(5;3), and 0.1-th quantile QX(0.1)1.155Q_X(0.1)\approx1.155, 0.9-th quantile QX(0.9)8.845Q_X(0.9)\approx8.845.

Example PDF

Quantile Regression

What could quantile do? One step advance from linear regression. Given a dataset (X,Y)(X,Y), and a quantile τ(0,1)\tau\in(0,1), approximate the conditional quantile function at τ\tau: QYX(τ)=XβτQ_{Y|X}(\tau)=X\beta_\tau, under the loss function
ρτ(u)={(τ1)uu0τuu>0 \rho_\tau(u)=\begin{cases} (\tau-1) u & u\leq 0 \\ \tau u & u> 0 \end{cases}
where u=YXβτu=Y-X\beta_\tau is the error.

As we can see, if we fix a τ\tau, the formulation is essentially the same as linear regression other than the special loss function. How is this regression useful? Let us look an example.

Suppose you ordered UberEats, and you have the dataset of history delivery data between distance and delivery time. Now you need to give a time range estimate given the distance that covers 80% of the customers’ delivery time. We could fit a 0.1-th and a 0.9-th regression model and give a range between them.

Quantile Regression

Quantile Loss

Take a look again at the expression of quantile loss, we could observe that the penalty for underestimation/overestimation is different, depending on τ\tau. If we look at the loss function at τ=0.1\tau=0.1, we have
ρ0.1(u)={0.9uu00.1uu>0 \rho_{0.1}(u)=\begin{cases} -0.9 u & u\leq 0 \\ 0.1 u & u> 0 \end{cases}
For underestimation (u>0u>0), the penalty is 0.1, but for overestimation, the penalty is -0.9. If the regressor is at the middle of the data blob, then how should it move to minimize the loss? Obviously, to have more underestimation is good, so the regressor will move down to the red line, which is essentially how quantile regression works.

If the quantile loss at any τ\tau is small though, we could conclude that we captured almost all the details of the distribution, even if the density is low. So could this be our substitute for the KL-divergence?

Modeling from Another Perspective

So, instead of modeling the density directly, we could approximate the inverse CDF. This is almost equivalent because we could deduct the density estimate from the inverse CDF.

Similar to approximating density functions, we have to decide on a factorization of the Quantile function (inverse CDF) in high-dimensional space to make it tractable.

  1. If the CDF is of a single variable τ\tau, the we need comonotonic property to ensure invertibility (obvious because there is no negative probability, CDF must be non-decreasing along any dimension, which is what comonotonic property essentially implies). FX1(τ)=(FX11(τ),FX21(τ),,FXn1(τ)F_X^{-1}(\tau)=(F_{X_1}^{-1}(\tau), F_{X_2}^{-1}(\tau),\dots,F_{X_n}^{-1}(\tau) is a very strong assumption, and could hardly be used broadly.
  2. On the other hand, if we use a separate τi\tau_i for each component, FX1(τ)=(FX11(τ1),FX21(τ2),,FXn1(τn)F_X^{-1}(\vec{\tau})=(F_{X_1}^{-1}(\tau_1), F_{X_2}^{-1}(\tau_2),\dots,F_{X_n}^{-1}(\tau_n), we are assuming independence between all components, which is unrealisticly restrictive for many domains.

So we do the same, factorize the CDF, and make some assumptions on conditional independence.
FX(x)=P(X1x1,,Xnxn)=i=1nFXiXi1,,X1(xi)FX1(τjoint)=(FX11(τ1),,FXnXn1,1(τn)) \begin{align*} F_X(x)&=\mathbb{P}(X_1\leq x_1,\dots,X_n\leq x_n)=\prod_{i=1}^n F_{X_i|X_{i-1},\dots,X_1}(x_i) \\ F_X^{-1}(\tau_\text{joint})&=\left(F_{X_1}^{-1}(\tau_1),\dots,F_{X_n|X_{n-1},\dots}^{-1}(\tau_n)\right) \end{align*}

Let’s Reparameterize on Sampled Quantiles

Naturally, since we are approximating the quantile function (inverse CDF), we choose the quantile loss to minimize. However, does this loss really leads to some divergence metric between pθp_\theta and pXp_X? In other words, are we doing the correct thing, eventually approximating the density function?

Validity

Let us compute the expected quantile loss over the distribution for a quantile qq, following these steps:

  1. Expand the definition of Expectation.
  2. Split the first integral, and merge one of them with the second.
  3. Split the first integral again, and evaluate the second according to the definition of Expectation again.
  4. Evaluate the first integral by the definition of CDF, and take the second integral by part, where u=xu=x and dv=fP(x)dxdv=f_P(x)dx.
  5. Cancel the first two term, and we arrived at the final expression.

gτ(q)=EXP[ρτ(Xq)]=q(xq)(τ1)fP(x)dx+q(xq)τfP(x)dx=q(qx)fP(x)dx+(xq)τfP(x)dx=qqfP(x)dxqxfP(x)dx+(EXP[X]q)τ=qFP(q)([xFP(x)]qqFP(x)dx)+(EXP[X]q)τ=qFP(x)dx+(EXP[X]q)τ \begin{align*} g_\tau (q)&=\mathbb{E}_{X\sim P}\left[\rho_\tau\left(X-q \right ) \right ]\\ &=\int_{-\infty }^{q}(x-q)(\tau-1)f_P(x)dx+\int_{q}^{\infty}(x-q)\tau f_P(x)dx\\ &=\int_{-\infty}^{q}(q-x)f_P(x)dx+\int_{-\infty}^{\infty}(x-q)\tau f_P(x)dx\\ &=q\int_{-\infty}^{q}f_P(x)dx-\int_{-\infty}^{q}xf_P(x)dx+\left(\mathbb{E}_{X\sim P}\left[X \right ]-q \right )\tau\\ &=qF_P(q)-\left(\left[xF_P(x) \right ]_{-\infty}^q -\int_{-\infty}^q F_P(x)dx\right )+\left(\mathbb{E}_{X\sim P}\left[X \right ]-q \right )\tau\\ &=\int_{-\infty}^q F_P(x)dx+\left(\mathbb{E}_{X\sim P}\left[X \right ]-q \right )\tau\\ \end{align*}

Obviously, FP1(τ)F_P^{-1}(\tau) is the true quantile function, so it minimizes the expected quantile loss over PP. Let us get an expression on relative difference,

gτ(q)gτ(FP1(τ))=FP1(τ)qFP(x)dx+(FP1(τ)q)τ=FP1(τ)q(FP(x)τ)dx \begin{align*} g_\tau(q)-g_\tau(F_P^{-1}(\tau))&=\int_{F_P^{-1}(\tau)}^{q}F_P(x)dx+\left(F_P^{-1}(\tau)-q \right )\tau\\ &=\int_{F_P^{-1}(\tau)}^{q}\left(F_P(x)-\tau \right )dx \end{align*}

Suppose we have a distribution QQ, whose quantile function is FQ1(τ)F_Q^{-1}(\tau), then the expected relative loss over all τ\tau's are the following. Finally, we observed that there exists some metric on two distributions, called the Quantile Divergence.

EτU([0,1])[gτ(FQ1(τ))gτ(FP1(τ))]=01[FP1(τ)FQ1(τ)(FP(x)τ)dx]dτEτU([0,1])[gτ(FQ1(τ))]=01[FP1(τ)FQ1(τ)(FP(x)τ)dx]dτundefinedQuantile divergence q(P,Q)+EτU([0,1])[gτ(FP1(τ))]undefineddoes not depend on Q \begin{align*} \mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(F_Q^{-1}(\tau) \right )-g_\tau\left(F_P^{-1}(\tau) \right ) \right ]&=\int_{0}^{1}\left[\int_{F_P^{-1}(\tau)}^{F_Q^{-1}(\tau)}\left(F_P(x)-\tau \right )dx \right ]d\tau\\ \mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(F_Q^{-1}(\tau) \right )\right]&=\underbrace{\int_{0}^{1}\left[\int_{F_P^{-1}(\tau)}^{F_Q^{-1}(\tau)}\left(F_P(x)-\tau \right )dx \right ]d\tau}_{\text{Quantile divergence }q(P,Q)}\\ &\quad+\underbrace{\mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(F_P^{-1}(\tau) \right ) \right ]}_{\text{does not depend on }Q} \end{align*}

Quantile Divergence

This means that modeling the quantile function with quantile loss does lead to an eventual approximation to the true distribution. Let us have a closer look at how quantile divergence measures the difference between two distributions.

Quantile divergence
Correction: The integrand should be Fp(x)τF_p(x)-\tau, credits to 4.

For a given τ\tau, we can see the integral evaluates to a blue area, and we are summing them over all τ\tau's. Therefore, it is obvious that this integral disappears if two quantile function match exactly for every τ\tau, and this proves the statement that quantile loss will not miss any low density region of the distribution.

Unbiased Estimate

Finally, if we take the gradient over the expected relative quantile loss, we are getting an unbiased estimate to the gradient of quantile divergence. Once again, this proves that the new scheme works, leading to an approximation to the true distribution.

θEτU([0,1])[gτ(Qθˉ(τ))]=EτU([0,1])EXP[θρ(XQθˉ(τ))]=θq(P,Qθˉ) \begin{align*} \nabla_\theta\mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\left[g_\tau\left(\bar{Q_\theta}(\tau) \right ) \right ]&=\mathbb{E}_{\tau\sim\mathcal{U}([0,1])}\mathbb{E}_{X\sim P}\left[\nabla_\theta\rho\left(X-\bar{Q_\theta}(\tau) \right ) \right ]\\ &=\nabla_\theta q\left(P, \bar{Q_\theta}\right ) \end{align*}

Source of Randomness

We know that, specfically for VAEs, there is a reparameterization trick that separates the source of randomness to a standard Gaussian distribution. Now we models the quantile function, how do we get samples from it? Where is the source of randomness now?

It is τU([0,1])\tau\sim\mathcal{U}([0,1]). Since quantile functions are essentially inverse CDFs, taking uniform random τ\tau, and feeds it to the model will give us a sample back. Here is an illustration on how it works.

Updated reparameterization technique.

Results

Gated PixelCNN5 is a model that we try to modify. The original formulation have a location-dependent conditioning variable, which will not be used to condition on the random source τ\tau. Therefore, the modified version PixelIQN will produce pixel values directly, instead of outputing a discreet distribution over 256 levels of RGB values each.

PixelIQN Architecture, similar to Gated PixelCNN with tau taking the place of location-dependent conditioning.
PixelIQN Architecture, similar to Gated PixelCNN with τ\tau taking the place of location-dependent conditioning. Credits to 4.

The dataset used are CIFAR-10 and ImageNet 32x32, with metrics including Fréchet inception distance (FID) (lower is better) and Inception Score (IS) (higher is better).

Training and Performance

Training curve.
Training curves. Dotted lines correspond to models trained with class-label conditioning. Credits to 4.
Performance
Inception score and FID for CIFAR-10 and ImageNet. PixelIQN(1) is the small 15-layer version of the model. Models marked * refer to class-conditional training. Credits to 4.

Samples

CIFAR-10 Samples
CIFAR-10: Real example images (left), samples generated by PixelCNN (center), and samples generated by PixelIQN (right). Credits to 4.
ImageNet 32x32 Samples
ImageNet 32x32: Real example images (left), samples generated by PixelCNN (center), and samples generated by PixelIQN (right). Credits to 4.

Inpainting

Inpainting
Small ImageNet inpainting examples. Left image is the input provided to the network at the beginning of sampling, right is the original image, columns in between show different completions. Credits to 4.

Class Conditioning

Conditional Generation
Class-conditional samples from PixelIQN. Credits to 4.

Conclusion

Authors recognized that most current state-of-the-art models is built on top of development of Autoregressive modesl, VAEs, and GANs, which they all employ KL-divergence as the measure between two distributions. Now, we use the quantile function and quantile loss to achieve the same tasks, which may be more suitable for certain tasks that cares for low density region of the distrbution.

Although this new approach will not reduce training/inference time, it signifies an important perspective to look at the density estimation.


  1. van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, 2016c. ↩︎

  2. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. ↩︎

  3. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014. ↩︎

  4. Ostrovski, Georg, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. In International Conference on Machine Learning. PMLR, 2018. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  5. van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., and Kavukcuoglu, K. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016b. ↩︎