VAE2 Diagnosing and Enhancing VAE Models
Minhao Jiang (minhaoj2@illinois.edu)
Diagnosing and Enhancing VAE Models (ICLR '19)1
Introduction
Even though variational autoencoders (VAEs)2 have a wide variety applications in deep generative models, many aspects of the underlying energy function remain poorly understod. It is commonly believed that Gaussian encoder/decoder assumptions reduce the effectiveness of VAEs in generating realistic samples.
In this paper, the authors rigorously analyzed that reaching the global optimum does not guarantee that if VAE model can learn the true distribution of data, i.e., there could exist alternative solutions that both reach the global optimum and yet do not assign the same probability measure as ground-truth probability distribution. And it also proposed a two-stage remedy model, i.e., a two-stage VAE model to address the above issues and enhance the original VAE so that any gloablly minimizing solution is uniquely matched to the ground-truth distribution.
Problem Definition:
- The starting point is the desire to learn a probabilistic generative model of observable variables where is a r-dimensional manifold embedded in
- Denote a ground-truth probability measure on as where
- The canonical VAE attempts to approximate this ground-truth measure using parameterized density where , with and
We will consider two situations where and to illustrate the aforementioned non-uniqueness issues.
VAE Objective
- In the vanilla VAE model, we normally write the objective function to be optimized as evidence lower bound (ELBO):
- In this case, based on the ground-truth probability measure , we can rewrite it into:
- In principle, and can be arbitrary distributions. In the practical implementation, a commonly adopted distributional assumption is that both distribution are Gaussian, which was previously considered as a limitation of VAE.
Diagnosing the Non-uniqueness
Ideas: Even with the stated Gaussian distributions, there exist parameters that can simultaneously:
- Globally optimize the VAE object
- Recover the ground-truth probability measure in a certain sense
Definition 1: A -simple VAE is defined as a VAE model with dim[] = latent dimensions, the Gaussian encoder and the Gaussian decoder
With these definitions, we can now move to the discussion of -simple VAE with can achieve the above optimality criteria from the simpler case where followed by the extended scenario with .
When r=d
Assuming exists everywhere in , the minimal possible value of negative log-likelihood will necessarily occur if
Naturally we will conclude that
Theorem 2: Suppose that and there exists a density associated with the ground-truth measure that is nonzero everywhere on . Then for any , there is a sequence of -simple VAE model parameters such that
The theorem implies that as long as latent dimension is sufficiently large (i.e., ), the optimal ground-truth probability measure can be recovered, whether the encoder and decder has Gaussian assumptions or not, since the ground-truth probability measure being recovered almost everywhere is the necessary conditions for optimized objective value.
When r < d
- When both and are arbitrary/unconstrained, i.e., without Gaussian assumptions, then by forcing .
- To show that this does not need to happen, define a manifold density as the probability density of with respect to the volume measure of the manifold . If then this volume is the standard Lebesgue measure in and since when , may not exist everywhere in the ambient space.
Theorem 3: Assume and that there exists a manifold density associated with the ground-truth measure that is nonzero everywhere on . Then for any , there is a sequence of -simple VAE model parameters such that
- for all measurable sets with where is the boundary of .
Implications of this theorem:
- From (1), the VAE Gaussian assumptions do not prevent minimization of from converging to minus infinity.
- From (2), there exists solutions that assign a probability mass to most all measurable subsets of that is distinguishable from the ground-truth measure.
- In situation, the theorem necessitates that the ground-truth probability measure has been recovered almost everywhere.
- In situation, we have not ruled out the possibility that a different set of parameters can push the lost to and not achieve (2), i.e., the VAE can reach the lower bound of negative log-likelihood but fail to closely approximate .
Optimal Solutions
The necessary conditions for VAE optimal value would be induced from the following theorems.
Theorem 4: Let denote an optimal -simple VAE solution (with ) where the decoder variance is fixed. Moreover, we assume that is not a Gaussian distribution when . Then for any , there exists a such that
The theorem implies that if is not constrained, it must be that if we wish to minimize the VAE objective. While in existing practical VAE applications, it is standard to fix with the standard Gaussian assumptions during training.
Theorem 5: Applying the same conditions and definitions in Theorem 4, then for all drawn from , we also have that
- With this theorem, it indicates that any will be perfectly reconstructed by the VAE model at globally optimal solutions.
- Adding dimensions to latent dimension cannot improve the value of the VAE data term in meaningful way. In the training process, there are likely to be eigenvalues of the decoder covariance converging to 0 and converging to one. This demonstrats that VAE has the ability to detect the manifold dimension and select the proper number of latent dimensionsin practical environments.
- If VAE model parameters have learned a near optimal mapping onto using , then the VAE cost will scale as regardless of .
Two-Stage VAE Model
The above analysis suggests the following two-stage remedy:
- Given observed samples , train a -simple VAE, with , to estimate the unknown -dimensional ground-truth manifold embedded in using a minimal number of active latent dimensions. Generate latent samples via .
- Train a second -simple VAE, with independent parameters and latent representation , to learn the unknown distribution as a new ground-truth distribution and use samples to learn it.
- Samples approximating the original ground-truth can then be formed via the extended ancestral process
The structure of the first-stage of the Two-Stage VAE Model
Analysis:
- If the first stage was successful, then even though they will not generally resemble , samples from will have nonzero measure across the full ambient space .
- If , then the extra latent dimensions will be naturally filled in via randomness.
- Consequently, as long as we set , the operational regime of the second-stage VAE is effectively equivalent to the situation that the manifold dimension is equal to the ambient dimension, and reaching global optimum solutions would recover the ground-truth probability measure almost everywhere.
Experiment Results
The following table indicates the performance evaluation results of the experiments conducted on four significantly different datasets: MNIST, Fash-ion MNIST, CIFAR-10 and CelebA. The evaluation metrics used Frchet Inception Distance (FID)3 Score: used to assess the quality of images created by a generative model, comparing the generated images with the distribution of real images.
Note: The training of two stages need to be separate. Concatenating two stages and jointly training does not improve the performance.
Another set of experiments were conducted on the same datasets with different evaluation metrics. Kernel Inception Distance (KID)4 applies a polynomial-kernel Maximum Mean Discrepancy (MMD) measure to estimate the inception distance, as FID score is believed to exhibit bias in certain circumstances.
Analysis of the Results:
- The second stage of Two-Stage VAE model can reduce the gap between and , resulting in better manifold reconstruction.
- will converge to zero at any global minimum of the VAE objective, allowing for tighter image reconstructions with better manifold fit.
Contributions and Conclusions
- This paper rigorously proved that VAE global optimum can in fact uniquely learn a mapping to the correct ground-truth manifold when , but not necessarily the correct probability measure within this manifold.
- The proposed Two-Stage VAE model can resolve this issue and better recover the ground-truth manifold and reduce the gap between and . And this is the first demonstration of a VAE pipeline that can produce stable FID scores that are comparable to at least some popular GAN models under neutral testing conditions.
- The two-stage mechanism can improve the reconstruction of original distribution so that it has comparable performance with GAN models. This work narrows the gap between VAE and GAN models in terms of the realism of generated samples so that VAEs are worth considering in a broader range of applications.
- No need Gaussian assumption in the canonical VAE model to achieve the optimal solutions.
References
Bin Dai and David P. Wipf. Diagnosing and enhancing VAE models. In 7th International Conferenceon Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. ↩︎
Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. ↩︎
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and SeppHochreiter. GANs trained by a two time-scale update rule converge to a local Nashequilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637,2017. ↩︎
Miko laj Bi ́nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. arXiv:1801.01401, 2018 ↩︎