VAE4 Can VAE learn concepts from data unsupervised?
Dachun Sun (dsun18@illinois.edu)
Background on VAEs
The Variaional Autoencoders (VAEs) are a method of modeling data distribution by introducing latent random variables. Intuitively, VAE encodes the input into some compressed representation form in the latent space, and by forcing a correct reconstruction, hopefully, the model captures some insights in the data distribution.
To be formal, we propose
where is the prior distribution, usually assumed to be standard normal distribution, is the likelihood (probabilistic encoder), and is the posterior (probablistic decoder).
This posterior, however, need to calculate this integral, which is intractable if is high dimensional. So, we introduce another approximator .
The architecture can be summarized as follows:
Image Credits to Wikipedia on Variational Autoencoder__;Kg!!DZ3fjg!t6Ws-NJYmBcOfbpLUXAjo8DEPtCG30oxgsJHir59ycjXJtqwfs7MNQ-7N8ZuWMjx9w$ |
Naturally, in order to train the model, we want to maximize the probability on the dataset, . We achieve this through maximizing a lower bound of it.
Evidence Lower Bound (ELBO)
In practice, we use gradient descent to minimize the negative ELBO, called VAE Loss.
What is Desired? Disentanglement!
Disentanglement = Independence + Semantics
We are hoping that unsupervised learning could produce some results that have special meanings to human begins. One specific factor is whether each dimension in the latent space has a atomic meaning capturing some concept from the dataset.
- Unsupervised learning of a disentangled posterior distribution over the underlying generative factors of sensory data is a major challenge in AI research 1 2.
- Motivations include discovering independent components, controllable sample generation, and generalization/robustness.
- Facilitates interpretable decision making and controlled transfer.
The following graph from Ricky Chen’s talk demontrates what we want clearly. For different sample points in the latent space, they are of different genders, ages, and etcs. However, along the axis pointed by the arrow, it means whether the generated images wear sunglasses. Such disentanglement in the space means we can reliably predict how the generated images would change.
Axis-aligned traversal in the representation space and Global interpretability in data space. Image Credits to Ricky Chen’s talk at NIPS 2018 |
On the other hand, vanilla VAE’s objective is focusing only on reconstruction, if we look at the right examples, traversing along an axis does not produce a smooth changing trend.
Traversal of the rotationallatent dimension 3. |
Datasets for Disentanglement
Most of these datasets are specifically constructed so that the intended disentanglement factors are clear. Take dSprites as an example, factors include , , , , and .
Common datasets used in the disentanglement task4. |
Related Works
DC-IQN
An obvious attempt is to attach meanings to latent spaces by designers. Deep Convolutional Inverse Graphics Network (DC-IGN) 5 is a model similar to a VAE with special designed training procedure to enforce a designed latent space.
DC-IQN architecture 5. |
DC-IQN latent structure 5 |
In short, to enforce the structure, they use a modified training proceduring. First select a dimension corresponding to a factor, then form a minibatch where only that factor changes. They masked the output of other dimensions by averaging them so that the gradient signal is mixed and mingled, which is going to force the network to capture the changes in the specifie dimension.
DC-IQN training 5 |
InfoGAN
The GAN formulation uses a simple factored continuous input noise vector , but imposing no restrictions on how the generator may use it. So the generator may use it in a highly entangled way.
However, in InfoGAN6,
- Uses a set of structured latent variables , and assuming .
- The generator becomes .
- With no constraints, the generator could ignore , .
- There should be high mutual information between latent code and the generator distribution, meaning should be high.
An Attempt: -VAE
ELBO from Another Perspective
Quick Mention on Karush-Kuhn-Tucker (KKT) Conditions
If we have a non-linear programming problem.
Then, we can form the Lagrangian function:
If solves the problem, then Karush-Kuhn-Tucker Conditions holds:
- Stationarity: for minimization.
- Primal Feasibility: and .
- Dual Feasibility: .
- Complementary Slackness: .
If we take a look at the VAE loss again
We can formulate it as a constrained optimization problem:
Optimization Problem from ELBO
Rewriting it as a Lagrangian under KKT conditions, we have
Since according to the complementary slackness, we have the -VAE Loss:
-VAE Loss
Observations
-
Setting corresponds to the original VAE formulation.
-
Setting puts a stronger constraint on the latent bottleneck
- Limiting the capacity of while trying to maximize the log-likelihood should encourage the model to learn a more efficient representation.
- Higher value of should encourage the conditional independence in because more weights are put on the term.
-
Disentangled representation emerge when the right balance is found between reconstruction and latent capacity restriction.
- Create a trade-off between reconstruction fidelity and the quality of the disentanglement.
-
Note: In real implementations, is usually a training-step dependent variable, from 0 to the set value. The intuition behind this warm-up is to first get the network to be able to learn reconstruction.
Measuring Disentanglement - Higgins’ Metric
The basic idea to measure the quality of disentanglement is to have a pair of data points where one factor is fixed while others are sampled randomly. Then, we could let a classifier acts on the difference between their latent representations and see whether the fixed factor could be singled out, and report the classifier accuracy as the disentanglement score.
Image from 3. |
Results
Results from -VAE3. |
Results from -VAE3. |
Results from -VAE3. |
Results from -VAE3. |
Results from -VAE3. |
The Effect of Tuning
- is a mixing coefficient that weighs the gradients magnitudes between reconstruction and the prior-matching. So it is natural to consider normalized in analysis by the latent space dimension and input data dimension , .
- being too low or too high, the model would learn a entangled representation due to either too much or too little capacity in the latent bottleneck.
- Good disentanglement representations often lead to blurry reconstructions. However, in general, is necessary to achieve good disentanglement.
Positive correlation is present between the size of and the optimal normalised values of for disentangled factor learning for a fixed -VAE architecture. Orange approximately corresponds to unnormalized 3. |
How Does It Work?
Shortly, it is not clear from -VAE. Thus, we need to investigate why have a large penalizing the KL term has such effect.
Decomposing the ELBO More
Quick Mention on Mutual Information (MI)
Let be a pair of r.v.s over the space . Then their mutual information is
- KaTeX parse error: Undefined control sequence: \E at position 8: I(X;Y)=\̲E̲_X\left[D_\text…
intuitively measures how much could you infer about the other random variable if you are given knowledge about one of them. means independence because nothing can be inferred (not related at all).
TC-Decomposition
Define a uniform random variable on with which each data point relates. Denote and . is the \emph{aggregated posterior}. Then, we can decompose the regularization term in the ELBO as
- The index-code MI is the mutual information . It is argued that higher mutual information can lead to better disentanglement, but recent investigations also claim that a penalized one encourages compact and disentangled representations.
- The total correlation is one of many generalization of mutual information. It is a measure of dependency between the variables. This is claimed to be the main source of disentanglement.
- The dimension-wise KL divergence mainly prevents individual latent dimensions from deviating too far from priors. It acts like a complexity penalty.
-TCVAE Loss
- With ablation studies, tuning leads to the best results. The proposed model uses , which is the same object as in FactorVAE7.
- Provides better trade-off between density estimation and disentanglement. Different from -VAE, higher value of would not penalize the mutual information term too much.
Ablation study shows that setting to zero gives no clear improvement4. |
Estimate Density from Minibatch
Decomposition expression requires the evaluation of the density , which depends on the entire dataset. Simple Monte Carlo approximation is not likely to work, so we need weighted sampling. Given a minibach of samples , we use the estimator
where .
Measuring Disentanglement - Mutual Information Gap (MIG)
Higgins’ metric uses an extra classifier, which introduced hyperparameters and more training time. In addition, it cannot meausre axis alignment. Is there a metric based only on the distribution of factors and latent variables?
Mutual Information Gap (MIG) is introduced to solve these problems. Estimate the mutual information between a latent variable and a ground truth factor by , and use it in some way. A higher mutual information implies that contains a lot of information about . MI is maximal if there exists a deterministic, invertible relationship between and .
- For each , take that has the highest and the second highest mutual information with .
Averaging by and normalizing by the entropy provides a value between 0 and 1. implies good disentanglement.
Joint distribution between latent variables and ground truth factors. Image Credits to Ricky Chen’s talk at NIPS 2018 |
Mutual information between latent variables and ground truth factors. Image Credits to Ricky Chen’s talk at NIPS 2018 |
Results
Results from -TCVAE4. |
Results from -TCVAE4. |
Results from -TCVAE4. |
Conclusion
There have been many efforts in different machine learning communities to produce interpretable artificial intelligence systems. Unsupervised learning is a particularly hard task to enforce the interpretability and independence between representations. However, through the exploration and attempt, we have gained more understanding towards its objective (ELBO) and optimization process, and we have many amazing results where the underlying factors are disentangled.
Yoshua Bengio, Aaron Courville, and Pascal Vincent.Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. ↩︎
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017. ↩︎
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, XavierGlorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Ricky TQ Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Tejas D Kulkarni, Will Whitney, Pushmeet Kohli, and Joshua BTenenbaum. Deep convolutional inverse graphics network. arXiv preprint arXiv:1503.03167, 2015. ↩︎ ↩︎ ↩︎ ↩︎
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2180–2188, 2016. ↩︎
Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, pages 2649–2658. PMLR, 2018. ↩︎