GAN4 Toward a Better Global Loss Landscape of GANs

Generative Adversarial Nets (GANs) (Goodfellow et al., 2016) are a successful method for various practical applications. Meanwhile, current theoretical studys of GANs are digging into the underlying mechanism in the aspects of statistics and optimization.
For statistics, we have Goodfellow et al., (2014) link the minmax formulation and the JS (Jenson-Shannon) distance. Also, Wasserstein Generative Adversarial Nets (WGANs) (Arjovsky et al., 2017) adapted the Wasserstein distance as the loss function. The generalization problems of GANs are also investigated to see how applicable the GANs methods are. On the optimization side, the works of cyclic behavior (Balduzzi et al., 2018) research the issues that the optimization algorithm may cycle around a stable point and converges slowly even diverges. Another challenge for optimization is to avoid the sub-optimal local minima. For GANs, current works (Mescheder et al., 2018) either analyze convex-concave games or perform local analysis without considering the global analysis. Even though some works conduct global analysis, it only works on a simple setting without further generalization.
Therefore, to fill in the gaps of theoretical analysis on GANs, the main goal of this work is to perform a global analysis on the GANs landscape for general data distribution. In the paper, the work is put in a table to compare with other theoretical works:
enter image description here
Therefore, in specific, this work is defined as a global analysis on GANs landscape by comparing the SepGAN and RpGAN.

Relativistic GANs

Before further discussion, let’s first introduce the work Relativistic GANs (Jolicoeur-Martineau, 2018). As we know, the orginal GANs suffer from the problems of instable training and mode collapse so some works take efforts to solving the problems either by imposing regularizations or changing the loss. In this paper, the concepts of relativity are emphersized to suggest that the discriminator requires the relative probability of real images to help training.
enter image description here
The above table is comparing how the probability of real images change for standard GAN and relativistic GAN. Given the images of bread as real images, we have three situations:

The real images are bread and the fake images are dogs. Then, the absolute and relative probability of real images being bread is one.
The real images are bread and the fake images are dogs that look similar to bread. Then, the absolute probability of real images being bread is still one and the relative probability decreases.
The real images are bread in the shape of dogs and fake images are dogs. Then the probability of real images being bread is low while the relative probability increases.

In summary, as the discriminator scores the reality of samples, both real data and fake data should be taken into account so the judgement of absolute real or fake is changed to the probability of being real or fake. This point is illustrated by two aspects:

Utilizing the prior knowledge that the input samples consist of half real and half fake images, the discriminator can score real samples low instead of considering all samples real as fake samples are more realistic than real samples.
As the following figure shows, we often encourage the generator to generate more realistic images so pushing the discriminator to give high scores to fake images but the real samples are ignored. Instead, an ideal training should involve decreasing the score of real samples as the fake samples become more and more real.

Therefore, to make the discriminator judge the samples relatively, the loss objective is modified as following:
enter image description here
where pairs of real and fake samples are used for computing loss.

Landscape Analysis of GANs

The paper use a simple case to illustrate the phenomena of bad local minima in GANs.
enter image description here
Given real samples $x_1$ and $x_2$ , we need to generate fake samples $y_1$ and $y_2$ that maximially match the real samples. As the discriminator update, a boundary that classifies real and fake samples is set between the margins of two sample sets. Then, by updating the generator, the generated samples are forced to approach the real samples. As the process goes on and on, the generated samples will be trapped in a cluster around one real sample, which is so called mode collapse.
enter image description here
For the relativistic GAN (RpGAN), since we compare samples in each pair instead of sample sets, each generated sample can be pushed to different real sample, therefore, relieving the mode collapse problem in standard GANs.

Bad Local Minima

To further investigate how pairing influence local minima, the paper sets a two-point case:
Given two real samples $x_1$ and $x_2$ , and two fake samples $y_1$ and $y_2$ , we have four states $s_0, s_{1a}, s_{1b}, s_2$ that represent $\left|\left\{x_{1}, x_{2}\right\} \cap\left\{y_{1}, y_{2}\right\}\right|=0$ , $y_1=y_2\in\left\{x_{1}, x_{2}\right\}$ , $\left|\left\{x_{1}, x_{2}\right\} \cap\left\{y_{1}, y_{2}\right\}\right|=1$ , $\left\{x_{1}, x_{2}\right\}=\left\{y_{1}, y_{2}\right\}$ .
Representing these states by the divergence function we have the following:
$\phi_{J S}(Y, X)=\left\{\begin{array}{ll} -\log 2 \approx-0.6931 & \text { if } s_2, \\ -\log 2 / 2 \approx-0.3467 & \text { if } s_{1b},\\ \frac{1}{4}(2 \log 2-3 \log 3) \approx-0.4774 & \text { if } s_{1a}, \\ 0 & \text { if } s_{0} \end{array}\right.$
where $\phi_{J S}$ is the divergence measurement in JS-GAN. This representation can be transformed to an continuous curve:
enter image description here
We can see that the landscape of JS-GAN has a local minima at the state $s_{1a}$ and a global minima at state $S_2$ , meaning that as all fake samples are at the cluster of one real sample, they are trapped in the local minima, which corresponds to the previous illustration.
Then, for the $RpGAN$ , we have:
$\phi_{\mathrm{RS}}(Y, X)=\left\{\begin{array}{ll} -\log 2 \approx-0.6931 & \text { if } s_2, \\ -\frac{1}{2} \log 2 \approx-0.3466 & \text { if } s_{1a}, s_{1b}, \\ 0 & \text { if } s_0 \end{array}\right.$
and
enter image description here
Therefore, we can see the RS-GAN only has a global minima at $s_2$ .

Landscape Results in Function Space

Theorem 1

Suppose $x_{1}, x_{2}, \ldots, x_{n} \in \mathbb{R}^{d}$ are distinct. Suppose $h_{1}, h_{2}$ satisfy Assumptions 4.1, $4.2$ and 4.3. Then for separable-GAN loss $g_{\mathrm{SP}}(Y)$ defined in Eq. (5), we have: \(i) The global minimal value is $-\frac{1}{2} \sup _{t \in \mathbb{R}}\left(h_{1}(t)+h_{2}(-t)\right)$ , which is achieved iff $\left\{y_{1}, \ldots, y_{n}\right\}=\left\{x_{1}, \ldots, x_{n}\right\} .$ \(ii) If $y_{i} \in\left\{x_{1}, \ldots, x_{n}\right\}, i \in\{1,2, \ldots, n\}$ and $y_{i}=y_{j}$ for some $i \neq j$ , then $Y$ is a sub-optimal strict local minimum. Therefore, $g_{\mathrm{SP}}(Y)$ has $\left(n^{n}-n !\right)$ sub-optimal strict local minima.

This theorem generalizes the results in the case of $n=2$ to any n and is trying to show two results: first, for standard GAN, global minimal achieves when two sets of points perfectly match; second, local minima always exists as the assumptions and the second restriction in the theorem are satisfied. Since the second part of the theorem is easily to satisfy, the local minima cannot be avoided.

Definition (global-min-reachable)

We say a point $w$ is global-min-reachable for a function $F(w)$ if there exists a continuous path from $w$ to one global minimum of $F$ along which the value of $F(w)$ is non-increasing.

The definition defines any points on an non-increasing curve as global-min-reachable. Then we have the theorem 2.

Theorem 2

Suppose $x_{1}, x_{2}, \ldots, x_{n} \in \mathbb{R}^{d}$ are distinct. Suppose h satisfies Assumptions $4.4$ and $4.5 .$ Then for RpGAN loss $g_{\mathrm{R}}$ defined in Eq. (6): (i) The global minimal value is $h(0)$ , which is achieved iff $\left\{y_{1}, \ldots, y_{n}\right\}=\left\{x_{1}, \ldots, x_{n}\right\} .$ (ii) Any $Y$ is global-min-reachable for the function $g_{\mathrm{R}}(Y) .$

The main point of this theorm is that any points for the function is on an non-increaseing curve, which means there is no local minima for the loss.

Hence, these two theorems generalize the conclusion of two-point case to all other cases to show the global landscape, which is saying that for standard GANs, local sub-optimal and it’s consequence like mode collapse is hard to avoid, while the RpGANs have none local minimal, which, therefore, avoids the mode collapse and other issues.

Results

enter image description here
This figure is showing how the distribution of generated samples move from the initial state where fake samples are trapped in one real samples cluster. Given that red points are generated samples and blue points are real samples, It’s obvious that for RS-GAN, the generated samples move faster away from the trap of real sample cluster or equally, the local minimal. This results firmly support the theoretical analysis.
enter image description here
The above figure is showing how the discriminator loss changes as the training progresses. Compared to the loss of JS-GAN, the loss of RS-GAN converges much faster. Another observation is that the loss of JS-GAN stuck at 0.48 for a while, which is close to the value of $\phi_{JS}$ at state $s_{1a}$ so, again, it support the previous analysis.

Reference

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. In ICML, 2017.

D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642, 2018.

L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge? In ICML, 2018.

A. Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. In ICLR, 2018.