VAE2 Diagnosing and Enhancing VAE Models

Diagnosing and Enhancing VAE Models (ICLR '19)¹

Introduction

Even though variational autoencoders (VAEs)² have a wide variety applications in deep generative models, many aspects of the underlying energy function remain poorly understod. It is commonly believed that Gaussian encoder/decoder assumptions reduce the effectiveness of VAEs in generating realistic samples.

In this paper, the authors rigorously analyzed that reaching the global optimum does not guarantee that if VAE model can learn the true distribution of data, i.e., there could exist alternative solutions that both reach the global optimum and yet do not assign the same probability measure as ground-truth probability distribution. And it also proposed a two-stage remedy model, i.e., a two-stage VAE model to address the above issues and enhance the original VAE so that any gloablly minimizing solution is uniquely matched to the ground-truth distribution.

Problem Definition:

The starting point is the desire to learn a probabilistic generative model of observable variables $x \in \mathcal X$ where $\mathcal X$ is a r-dimensional manifold embedded in $\mathbb R ^d$
Denote a ground-truth probability measure on $\mathcal{X}$ as $\mu_{gt}$ where $\int_{\mathcal{X}} \mu_{gt} d\mathbf{x} = 1$
The canonical VAE attempts to approximate this ground-truth measure using parameterized density $p_{\theta}(\mathbf{x})$ where $p_{\theta}(x) = \int p_{\theta}(x | z) p(z) dz$ , $z \in \mathbb{R}^\kappa$ with $\kappa \approx r$ and $p(z) = \mathcal{N}(z | 0, \mathbf{I})$

We will consider two situations where $r < d$ and $r = d$ to illustrate the aforementioned non-uniqueness issues.

VAE Objective

In the vanilla VAE model, we normally write the objective function to be optimized as evidence lower bound (ELBO): $\begin{align*} \mathcal{L}_{\theta, \phi}(x) & = -\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x, z) - \log q_{\phi}(z|x)] \\ & = \mathbb{KL}(q_{\phi}(z|x) || p_{\theta}(z)) + \mathbb{E}_{q_{\phi}(z|x)}[-\log p_{\theta}(x, z)] \end{align*}$
In this case, based on the ground-truth probability measure $\mu_{gt}$ , we can rewrite it into: $\begin{align*} \mathcal{L}_{\theta, \phi}(x) & = \int_{\mathcal{X}}\{-\log p_{\theta}(x) + \mathbb{KL}[q_{\phi}(z|x) || p_{\theta}(z|x)]\} \mu_{gt} dx \geq \int_{\mathcal{X}} -\log p_{\theta}(x)\mu_{gt} dx \\ \mathcal{L}_{\theta, \phi}(x) & = \int_{\mathcal{X}} \{-\mathbb{E}_{q_{\phi}(z|x)} [\log p_{\theta}(z|x)] + \mathbb{KL}[q_{\phi}(z|x) || p(z)]\} \mu_{gt} dx \end{align*}$
In principle, $q_{\phi}(z|x)$ and $p_{\theta}(x|z)$ can be arbitrary distributions. In the practical implementation, a commonly adopted distributional assumption is that both distribution are Gaussian, which was previously considered as a limitation of VAE.

Diagnosing the Non-uniqueness

Ideas: Even with the stated Gaussian distributions, there exist parameters $\theta, \phi$ that can simultaneously:

Globally optimize the VAE object
Recover the ground-truth probability measure in a certain sense

Definition 1: A $\kappa$ -simple VAE is defined as a VAE model with dim[ $\mathbf{z}$ ] = $\kappa$ latent dimensions, the Gaussian encoder $q_{\phi}(z|X) = \mathcal{N}(z | \mu_z, \Sigma_z)$ and the Gaussian decoder $p_{\theta}(x|z) = \mathcal{N}(x | \mu_x, \Sigma_x)$
With these definitions, we can now move to the discussion of $\kappa$ -simple VAE with $\kappa \geq r$ can achieve the above optimality criteria from the simpler case where $r = d$ followed by the extended scenario with $r < d$ .

When r=d

Assuming $p_{gt}(x) = \mu_{gt}(dx) / dx$ exists everywhere in $\mathbb{R}^d$ , the minimal possible value of negative log-likelihood will necessarily occur if
$\mathbb{KL}[q_{\phi}(z|x) || p_{\theta}(z|x)] = 0 \text{ and } p_{\theta}(x) = p_{gt}(x) \text{ almost everywhere}$
Naturally we will conclude that

Theorem 2: Suppose that $r=d$ and there exists a density $p_{gt}(x)$ associated with the ground-truth measure $\mu_{gt}$ that is nonzero everywhere on $\mathbb{R}^d$ . Then for any $\kappa \geq r$ , there is a sequence of $\kappa$ -simple VAE model parameters $\{\theta_t^\star, \phi_t^\star\}$ such that
$\lim_{t\to\infty} \mathbb{KL}[q_{\phi_t^\star}(z|x) || p_{\theta_t^\star}(z|x)] = 0 \text{ and } \lim_{t\to\infty} p_{\theta_t^\star}(x) = p_{gt}(x) \text{ almost everywhere}$
The theorem implies that as long as latent dimension is sufficiently large (i.e., $\kappa \geq r$ ), the optimal ground-truth probability measure can be recovered, whether the encoder and decder has Gaussian assumptions or not, since the ground-truth probability measure being recovered almost everywhere is the necessary conditions for optimized objective value.

When r < d

When both $q_\phi(z|x)$ and $p_{\theta}(x|z)$ are arbitrary/unconstrained, i.e., without Gaussian assumptions, then $\inf_{\phi, \theta} \mathcal{L}(\theta, \phi) = - \infty$ by forcing $q_{\phi}(z|x) = p_{\theta}(z|x)$ .
To show that this does not need to happen, define a manifold density $\tilde p_{gt}(x)$ as the probability density of $\mu_{gt}$ with respect to the volume measure of the manifold $\mathcal{X}$ . If $d = r$ then this volume is the standard Lebesgue measure in $\mathbb{R}^d$ and $\tilde p_{gt}(x) = p_{gt}(x)$ since when $r < d$ , $p_{gt}(x)$ may not exist everywhere in the ambient space.

Theorem 3: Assume $r < d$ and that there exists a manifold density $\tilde p_{gt}(x)$ associated with the ground-truth measure $\mu_{gt}$ that is nonzero everywhere on $\mathcal{X}$ . Then for any $\kappa \geq r$ , there is a sequence of $\kappa$ -simple VAE model parameters $\{\theta_t^\star, \phi_t^\star\}$ such that

$\lim_{t\to\infty} \mathbb{KL}[q_{\phi_t^\star}(z|x) || p_{\theta_t^\star}(z|x)] = 0 \text{ and } \lim_{t \to \infty} \int_{\mathcal{X}} -\log p_{\theta_t^\star}(x) \mu_{gt} dx = -\infty$
$\lim_{t\to\infty} \int_{\mathcal{X} \in A} p_{\theta_t^\star} (x) dx = \mu_{gt} (A \cup \mathcal{X})$ for all measurable sets $A \subseteq \mathbb{R}^d$ with $\mu_{gt}(\partial A \cup \mathcal{X}) = 0$ where $\partial A$ is the boundary of $A$ .

Implications of this theorem:

From (1), the VAE Gaussian assumptions do not prevent minimization of $\mathcal{L}(\theta, \phi)$ from converging to minus infinity.
From (2), there exists solutions that assign a probability mass to most all measurable subsets of $\mathbb{R}^d$ that is distinguishable from the ground-truth measure.
In $r = d$ situation, the theorem necessitates that the ground-truth probability measure has been recovered almost everywhere.
In $r < d$ situation, we have not ruled out the possibility that a different set of parameters $\{\theta, \phi\}$ can push the lost to $- \infty$ and not achieve (2), i.e., the VAE can reach the lower bound of negative log-likelihood but fail to closely approximate $\mu_{gt}$ .

Optimal Solutions

The necessary conditions for VAE optimal value would be induced from the following theorems.

Theorem 4: Let $\{\theta^\star_\gamma, \phi_\gamma^\star\}$ denote an optimal $\kappa$ -simple VAE solution (with $\kappa \geq r$ ) where the decoder variance $\gamma$ is fixed. Moreover, we assume that $\mu_{gt}$ is not a Gaussian distribution when $d = r$ . Then for any $\gamma > 0$ , there exists a $\gamma' < \gamma$ such that $\mathcal{L}(\theta_{\gamma'}^\star, \phi_{\gamma'}^\star) < \mathcal{L}(\theta_{\gamma}^\star, \phi_{\gamma}^\star)$

The theorem implies that if $\gamma$ is not constrained, it must be that $\gamma \to 0$ if we wish to minimize the VAE objective. While in existing practical VAE applications, it is standard to fix $\gamma \approx 1$ with the standard Gaussian assumptions during training.

Theorem 5: Applying the same conditions and definitions in Theorem 4, then for all $x$ drawn from $\mu_{gt}$ , we also have that
$\lim_{\gamma \to 0} f_{\mu_x} [f_{\mu_z}(x; \phi_{\gamma}^\star) + f_{S_z}(x; \theta{\gamma}^\star) \epsilon; \phi_\gamma^\star] = \lim_{\gamma \to 0} f_{\mu_x}[f_{\mu_z}(x;\phi_\gamma^\star); \theta_\gamma^\star] = x, \forall \epsilon \in \mathbb{R}^\kappa$

With this theorem, it indicates that any $\mathbf{x} \in \mathcal{X}$ will be perfectly reconstructed by the VAE model at globally optimal solutions.
Adding dimensions to latent dimension cannot improve the value of the VAE data term in meaningful way. In the training process, there are likely to be $r$ eigenvalues of the decoder covariance converging to 0 and $\kappa - r$ converging to one. This demonstrats that VAE has the ability to detect the manifold dimension and select the proper number of latent dimensionsin practical environments.
If VAE model parameters have learned a near optimal mapping onto $\mathcal{X}$ using $\gamma \approx 0$ , then the VAE cost will scale as $(d - r) \log \gamma$ regardless of $\mu_{gt}$ .

Two-Stage VAE Model

The above analysis suggests the following two-stage remedy:

Given $n$ observed samples $\{x^{(i)}\}^n_{i=1}$ , train a $\kappa$ -simple VAE, with $\kappa \geq r$ , to estimate the unknown $r$ -dimensional ground-truth manifold $\mathcal{X}$ embedded in $\mathcal{R}^d$ using a minimal number of active latent dimensions. Generate latent samples $\{z^{(i)}\}^n_{i=1}$ via $z_{(i)} \sim q_{\phi}(z|x^{(i)})$ .
Train a second $\kappa$ -simple VAE, with independent parameters $\{\theta', \phi'\}$ and latent representation $u$ , to learn the unknown distribution $q_\phi(z)$ as a new ground-truth distribution and use samples $\{z^{(i)}\}^n_{i=1}$ to learn it.
Samples approximating the original ground-truth $\mu_{gt}$ can then be formed via the extended ancestral process $u \sim \mathcal{N}(u | 0, \mathbf{I}), z \sim p_{\theta'}(z | u), x \sim p_{\theta}(x|z)$

The structure of the first-stage of the Two-Stage VAE Model

Analysis:

If the first stage was successful, then even though they will not generally resemble $\mathcal{N}(z|0, \mathbf{I})$ , samples from $q_\phi(z)$ will have nonzero measure across the full ambient space $\mathbb{R}^\kappa$ .
If $\kappa > r$ , then the extra latent dimensions will be naturally filled in via randomness.
Consequently, as long as we set $\kappa \geq r$ , the operational regime of the second-stage VAE is effectively equivalent to the situation that the manifold dimension is equal to the ambient dimension, and reaching global optimum solutions would recover the ground-truth probability measure almost everywhere.

Experiment Results

The following table indicates the performance evaluation results of the experiments conducted on four significantly different datasets: MNIST, Fash-ion MNIST, CIFAR-10 and CelebA. The evaluation metrics used Frchet Inception Distance (FID)³ Score: used to assess the quality of images created by a generative model, comparing the generated images with the distribution of real images.
enter image description here
Note: The training of two stages need to be separate. Concatenating two stages and jointly training does not improve the performance.

Another set of experiments were conducted on the same datasets with different evaluation metrics. Kernel Inception Distance (KID)⁴ applies a polynomial-kernel Maximum Mean Discrepancy (MMD) measure to estimate the inception distance, as FID score is believed to exhibit bias in certain circumstances.
enter image description here

Analysis of the Results:

The second stage of Two-Stage VAE model can reduce the gap between $q(z)$ and $p(z)$ , resulting in better manifold reconstruction.
$\gamma$ will converge to zero at any global minimum of the VAE objective, allowing for tighter image reconstructions with better manifold fit.

enter image description here

Contributions and Conclusions

This paper rigorously proved that VAE global optimum can in fact uniquely learn a mapping to the correct ground-truth manifold when $r < d$ , but not necessarily the correct probability measure within this manifold.
The proposed Two-Stage VAE model can resolve this issue and better recover the ground-truth manifold and reduce the gap between $p_\theta(z|x)$ and $q_\phi(z|x)$ . And this is the first demonstration of a VAE pipeline that can produce stable FID scores that are comparable to at least some popular GAN models under neutral testing conditions.
The two-stage mechanism can improve the reconstruction of original distribution so that it has comparable performance with GAN models. This work narrows the gap between VAE and GAN models in terms of the realism of generated samples so that VAEs are worth considering in a broader range of applications.
No need Gaussian assumption in the canonical VAE model to achieve the optimal solutions.

References

Bin Dai and David P. Wipf. Diagnosing and enhancing VAE models. In 7th International Conferenceon Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. ↩︎
Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. ↩︎
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and SeppHochreiter. GANs trained by a two time-scale update rule converge to a local Nashequilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637,2017. ↩︎
Miko laj Bi ́nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. arXiv:1801.01401, 2018 ↩︎

Diagnosing and Enhancing VAE Models (ICLR '19)1