Diagnosing and Enhancing VAE Models (ICLR '19)1

Introduction

Even though variational autoencoders (VAEs)2 have a wide variety applications in deep generative models, many aspects of the underlying energy function remain poorly understod. It is commonly believed that Gaussian encoder/decoder assumptions reduce the effectiveness of VAEs in generating realistic samples.

In this paper, the authors rigorously analyzed that reaching the global optimum does not guarantee that if VAE model can learn the true distribution of data, i.e., there could exist alternative solutions that both reach the global optimum and yet do not assign the same probability measure as ground-truth probability distribution. And it also proposed a two-stage remedy model, i.e., a two-stage VAE model to address the above issues and enhance the original VAE so that any gloablly minimizing solution is uniquely matched to the ground-truth distribution.

Problem Definition:

  • The starting point is the desire to learn a probabilistic generative model of observable variables xXx \in \mathcal Xwhere X\mathcal X is a r-dimensional manifold embedded in Rd\mathbb R ^d
  • Denote a ground-truth probability measure on X\mathcal{X} as μgt\mu_{gt} where Xμgtdx=1\int_{\mathcal{X}} \mu_{gt} d\mathbf{x} = 1
  • The canonical VAE attempts to approximate this ground-truth measure using parameterized density pθ(x)p_{\theta}(\mathbf{x}) where pθ(x)=pθ(xz)p(z)dzp_{\theta}(x) = \int p_{\theta}(x | z) p(z) dz, zRκz \in \mathbb{R}^\kappa with κr\kappa \approx r and p(z)=N(z0,I)p(z) = \mathcal{N}(z | 0, \mathbf{I})

We will consider two situations where r<dr < d and r=dr = d to illustrate the aforementioned non-uniqueness issues.

VAE Objective

  • In the vanilla VAE model, we normally write the objective function to be optimized as evidence lower bound (ELBO): Lθ,ϕ(x)=Eqϕ(zx)[logpθ(x,z)logqϕ(zx)]=KL(qϕ(zx)pθ(z))+Eqϕ(zx)[logpθ(x,z)]\begin{align*} \mathcal{L}_{\theta, \phi}(x) & = -\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x, z) - \log q_{\phi}(z|x)] \\ & = \mathbb{KL}(q_{\phi}(z|x) || p_{\theta}(z)) + \mathbb{E}_{q_{\phi}(z|x)}[-\log p_{\theta}(x, z)] \end{align*}
  • In this case, based on the ground-truth probability measure μgt\mu_{gt}, we can rewrite it into: Lθ,ϕ(x)=X{logpθ(x)+KL[qϕ(zx)pθ(zx)]}μgtdxXlogpθ(x)μgtdxLθ,ϕ(x)=X{Eqϕ(zx)[logpθ(zx)]+KL[qϕ(zx)p(z)]}μgtdx\begin{align*} \mathcal{L}_{\theta, \phi}(x) & = \int_{\mathcal{X}}\{-\log p_{\theta}(x) + \mathbb{KL}[q_{\phi}(z|x) || p_{\theta}(z|x)]\} \mu_{gt} dx \geq \int_{\mathcal{X}} -\log p_{\theta}(x)\mu_{gt} dx \\ \mathcal{L}_{\theta, \phi}(x) & = \int_{\mathcal{X}} \{-\mathbb{E}_{q_{\phi}(z|x)} [\log p_{\theta}(z|x)] + \mathbb{KL}[q_{\phi}(z|x) || p(z)]\} \mu_{gt} dx \end{align*}
  • In principle, qϕ(zx)q_{\phi}(z|x) and pθ(xz)p_{\theta}(x|z) can be arbitrary distributions. In the practical implementation, a commonly adopted distributional assumption is that both distribution are Gaussian, which was previously considered as a limitation of VAE.

Diagnosing the Non-uniqueness

Ideas: Even with the stated Gaussian distributions, there exist parameters θ,ϕ\theta, \phi that can simultaneously:

  1. Globally optimize the VAE object
  2. Recover the ground-truth probability measure in a certain sense

Definition 1: A κ\kappa-simple VAE is defined as a VAE model with dim[z\mathbf{z}] = κ\kappa latent dimensions, the Gaussian encoder qϕ(zX)=N(zμz,Σz)q_{\phi}(z|X) = \mathcal{N}(z | \mu_z, \Sigma_z) and the Gaussian decoder pθ(xz)=N(xμx,Σx)p_{\theta}(x|z) = \mathcal{N}(x | \mu_x, \Sigma_x)
With these definitions, we can now move to the discussion of κ\kappa-simple VAE with κr\kappa \geq r can achieve the above optimality criteria from the simpler case where r=dr = d followed by the extended scenario with r<dr < d.

When r=d

Assuming pgt(x)=μgt(dx)/dxp_{gt}(x) = \mu_{gt}(dx) / dx exists everywhere in Rd\mathbb{R}^d, the minimal possible value of negative log-likelihood will necessarily occur if
KL[qϕ(zx)pθ(zx)]=0 and pθ(x)=pgt(x) almost everywhere\mathbb{KL}[q_{\phi}(z|x) || p_{\theta}(z|x)] = 0 \text{ and } p_{\theta}(x) = p_{gt}(x) \text{ almost everywhere}
Naturally we will conclude that

Theorem 2: Suppose that r=dr=d and there exists a density pgt(x)p_{gt}(x) associated with the ground-truth measure μgt\mu_{gt} that is nonzero everywhere on Rd\mathbb{R}^d. Then for any κr\kappa \geq r, there is a sequence of κ\kappa-simple VAE model parameters {θt,ϕt}\{\theta_t^\star, \phi_t^\star\} such that
limtKL[qϕt(zx)pθt(zx)]=0 and limtpθt(x)=pgt(x) almost everywhere \lim_{t\to\infty} \mathbb{KL}[q_{\phi_t^\star}(z|x) || p_{\theta_t^\star}(z|x)] = 0 \text{ and } \lim_{t\to\infty} p_{\theta_t^\star}(x) = p_{gt}(x) \text{ almost everywhere}
The theorem implies that as long as latent dimension is sufficiently large (i.e., κr\kappa \geq r), the optimal ground-truth probability measure can be recovered, whether the encoder and decder has Gaussian assumptions or not, since the ground-truth probability measure being recovered almost everywhere is the necessary conditions for optimized objective value.

When r < d

  • When both qϕ(zx)q_\phi(z|x) and pθ(xz)p_{\theta}(x|z) are arbitrary/unconstrained, i.e., without Gaussian assumptions, then infϕ,θL(θ,ϕ)=\inf_{\phi, \theta} \mathcal{L}(\theta, \phi) = - \infty by forcing qϕ(zx)=pθ(zx)q_{\phi}(z|x) = p_{\theta}(z|x).
  • To show that this does not need to happen, define a manifold density p~gt(x)\tilde p_{gt}(x) as the probability density of μgt\mu_{gt} with respect to the volume measure of the manifold X\mathcal{X}. If d=rd = r then this volume is the standard Lebesgue measure in Rd\mathbb{R}^d and p~gt(x)=pgt(x)\tilde p_{gt}(x) = p_{gt}(x) since when r<dr < d, pgt(x)p_{gt}(x) may not exist everywhere in the ambient space.

Theorem 3: Assume r<dr < d and that there exists a manifold density p~gt(x)\tilde p_{gt}(x) associated with the ground-truth measure μgt\mu_{gt} that is nonzero everywhere on X\mathcal{X}. Then for any κr\kappa \geq r, there is a sequence of κ\kappa-simple VAE model parameters {θt,ϕt}\{\theta_t^\star, \phi_t^\star\} such that

  • limtKL[qϕt(zx)pθt(zx)]=0 and limtXlogpθt(x)μgtdx=\lim_{t\to\infty} \mathbb{KL}[q_{\phi_t^\star}(z|x) || p_{\theta_t^\star}(z|x)] = 0 \text{ and } \lim_{t \to \infty} \int_{\mathcal{X}} -\log p_{\theta_t^\star}(x) \mu_{gt} dx = -\infty
  • limtXApθt(x)dx=μgt(AX)\lim_{t\to\infty} \int_{\mathcal{X} \in A} p_{\theta_t^\star} (x) dx = \mu_{gt} (A \cup \mathcal{X}) for all measurable sets ARdA \subseteq \mathbb{R}^d with μgt(AX)=0\mu_{gt}(\partial A \cup \mathcal{X}) = 0 where A\partial A is the boundary of AA.

Implications of this theorem:

  • From (1), the VAE Gaussian assumptions do not prevent minimization of L(θ,ϕ)\mathcal{L}(\theta, \phi) from converging to minus infinity.
  • From (2), there exists solutions that assign a probability mass to most all measurable subsets of Rd\mathbb{R}^d that is distinguishable from the ground-truth measure.
  • In r=dr = d situation, the theorem necessitates that the ground-truth probability measure has been recovered almost everywhere.
  • In r<dr < d situation, we have not ruled out the possibility that a different set of parameters {θ,ϕ}\{\theta, \phi\} can push the lost to - \infty and not achieve (2), i.e., the VAE can reach the lower bound of negative log-likelihood but fail to closely approximate μgt\mu_{gt}.

Optimal Solutions

The necessary conditions for VAE optimal value would be induced from the following theorems.

Theorem 4: Let {θγ,ϕγ}\{\theta^\star_\gamma, \phi_\gamma^\star\} denote an optimal κ\kappa-simple VAE solution (with κr\kappa \geq r) where the decoder variance γ\gamma is fixed. Moreover, we assume that μgt\mu_{gt} is not a Gaussian distribution when d=rd = r. Then for any γ>0\gamma > 0, there exists a γ<γ\gamma' < \gamma such that L(θγ,ϕγ)<L(θγ,ϕγ)\mathcal{L}(\theta_{\gamma'}^\star, \phi_{\gamma'}^\star) < \mathcal{L}(\theta_{\gamma}^\star, \phi_{\gamma}^\star)

The theorem implies that if γ\gamma is not constrained, it must be that γ0\gamma \to 0 if we wish to minimize the VAE objective. While in existing practical VAE applications, it is standard to fix γ1\gamma \approx 1 with the standard Gaussian assumptions during training.

Theorem 5: Applying the same conditions and definitions in Theorem 4, then for all xx drawn from μgt\mu_{gt}, we also have that
limγ0fμx[fμz(x;ϕγ)+fSz(x;θγ)ϵ;ϕγ]=limγ0fμx[fμz(x;ϕγ);θγ]=x,ϵRκ\lim_{\gamma \to 0} f_{\mu_x} [f_{\mu_z}(x; \phi_{\gamma}^\star) + f_{S_z}(x; \theta{\gamma}^\star) \epsilon; \phi_\gamma^\star] = \lim_{\gamma \to 0} f_{\mu_x}[f_{\mu_z}(x;\phi_\gamma^\star); \theta_\gamma^\star] = x, \forall \epsilon \in \mathbb{R}^\kappa

  • With this theorem, it indicates that any xX\mathbf{x} \in \mathcal{X} will be perfectly reconstructed by the VAE model at globally optimal solutions.
  • Adding dimensions to latent dimension cannot improve the value of the VAE data term in meaningful way. In the training process, there are likely to be rr eigenvalues of the decoder covariance converging to 0 and κr\kappa - r converging to one. This demonstrats that VAE has the ability to detect the manifold dimension and select the proper number of latent dimensionsin practical environments.
  • If VAE model parameters have learned a near optimal mapping onto X\mathcal{X} using γ0\gamma \approx 0, then the VAE cost will scale as (dr)logγ(d - r) \log \gamma regardless of μgt\mu_{gt}.

Two-Stage VAE Model

The above analysis suggests the following two-stage remedy:

  1. Given nn observed samples {x(i)}i=1n\{x^{(i)}\}^n_{i=1}, train a κ\kappa-simple VAE, with κr\kappa \geq r, to estimate the unknown rr-dimensional ground-truth manifold X\mathcal{X} embedded in Rd\mathcal{R}^d using a minimal number of active latent dimensions. Generate latent samples {z(i)}i=1n\{z^{(i)}\}^n_{i=1} via z(i)qϕ(zx(i))z_{(i)} \sim q_{\phi}(z|x^{(i)}).
  2. Train a second κ\kappa-simple VAE, with independent parameters {θ,ϕ}\{\theta', \phi'\} and latent representation uu, to learn the unknown distribution qϕ(z)q_\phi(z) as a new ground-truth distribution and use samples {z(i)}i=1n\{z^{(i)}\}^n_{i=1} to learn it.
  3. Samples approximating the original ground-truth μgt\mu_{gt} can then be formed via the extended ancestral process uN(u0,I),zpθ(zu),xpθ(xz)u \sim \mathcal{N}(u | 0, \mathbf{I}), z \sim p_{\theta'}(z | u), x \sim p_{\theta}(x|z)

The structure
The structure of the first-stage of the Two-Stage VAE Model

Analysis:

  • If the first stage was successful, then even though they will not generally resemble N(z0,I)\mathcal{N}(z|0, \mathbf{I}), samples from qϕ(z)q_\phi(z) will have nonzero measure across the full ambient space Rκ\mathbb{R}^\kappa.
  • If κ>r\kappa > r, then the extra latent dimensions will be naturally filled in via randomness.
  • Consequently, as long as we set κr\kappa \geq r, the operational regime of the second-stage VAE is effectively equivalent to the situation that the manifold dimension is equal to the ambient dimension, and reaching global optimum solutions would recover the ground-truth probability measure almost everywhere.

Experiment Results

The following table indicates the performance evaluation results of the experiments conducted on four significantly different datasets: MNIST, Fash-ion MNIST, CIFAR-10 and CelebA. The evaluation metrics used Frchet Inception Distance (FID)3 Score: used to assess the quality of images created by a generative model, comparing the generated images with the distribution of real images.
enter image description here
Note: The training of two stages need to be separate. Concatenating two stages and jointly training does not improve the performance.

Another set of experiments were conducted on the same datasets with different evaluation metrics. Kernel Inception Distance (KID)4 applies a polynomial-kernel Maximum Mean Discrepancy (MMD) measure to estimate the inception distance, as FID score is believed to exhibit bias in certain circumstances.
enter image description here

Analysis of the Results:

  • The second stage of Two-Stage VAE model can reduce the gap between q(z)q(z) and p(z)p(z), resulting in better manifold reconstruction.
  • γ\gamma will converge to zero at any global minimum of the VAE objective, allowing for tighter image reconstructions with better manifold fit.

enter image description here

Contributions and Conclusions

  1. This paper rigorously proved that VAE global optimum can in fact uniquely learn a mapping to the correct ground-truth manifold when r<dr < d, but not necessarily the correct probability measure within this manifold.
  2. The proposed Two-Stage VAE model can resolve this issue and better recover the ground-truth manifold and reduce the gap between pθ(zx)p_\theta(z|x) and qϕ(zx)q_\phi(z|x). And this is the first demonstration of a VAE pipeline that can produce stable FID scores that are comparable to at least some popular GAN models under neutral testing conditions.
  3. The two-stage mechanism can improve the reconstruction of original distribution so that it has comparable performance with GAN models. This work narrows the gap between VAE and GAN models in terms of the realism of generated samples so that VAEs are worth considering in a broader range of applications.
  4. No need Gaussian assumption in the canonical VAE model to achieve the optimal solutions.

References


  1. Bin Dai and David P. Wipf. Diagnosing and enhancing VAE models. In 7th International Conferenceon Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. ↩︎

  2. Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. ↩︎

  3. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and SeppHochreiter. GANs trained by a two time-scale update rule converge to a local Nashequilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637,2017. ↩︎

  4. Miko laj Bi ́nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. arXiv:1801.01401, 2018 ↩︎