Introduction

Background

Generative Adversarial Networks (GANs) have achieved great success at generating realistic and sharp looking images. However, they still remain remarkably diffcult to train and most papers at that time dedicated to heuristically finding stable architecture. And there is little to no theory explaining the unstable behaviour of GAN training.

GAN Formulation

GAN in its original formulation, plays the following game:
minGmaxDV(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]\min_G\max_D V(D,G) = \mathbb E_{x\sim p_{data}(x)}[\log D(x)]+\mathbb E_{z\sim p_z(z)}[\log(1-D(G(z)))]
It can be reformulated as:
C(G)=maxDV(G,D)C(G) =\max_D V(G,D)
$ = \mathbb E_{x\sim p_{data}}[\log D^G(x)]+\mathbb E{z\sim p_z}[\log(1-D^G(G(z)))]$
$ = \mathbb E
{x\sim p_{data}}[\log D^G(x)]+\mathbb E{x\sim p_g}[\log(1-D^G(x))]$
$ = \mathbb E
{x\sim p_{data}}[\log \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}]+\mathbb E_{x\sim p_g}[\log \frac{p_g(x)}{p_{data}(x)+p_g(x)}]$
and this is called the virtual training criterion.

It can be shown that C(G) is equivalent to the following:
log(4)+2JSD(pdatapg)-\log(4)+2\cdot JSD(p_{data}||p_g)
where JSD is the Jensen-Shannon divergence:
JSD(PrPg)=12KL(PrPA)+12KL(PgPA)JSD(\mathbb P_r||\mathbb P_g)=\frac{1}{2}KL(\mathbb P_r||\mathbb P_A)+\frac{1}{2}KL(\mathbb P_g||\mathbb P_A)
where PA\mathbb P_A is the ‘average’ distribution, with density Pr+Pg2\frac{P_r+P_g}{2}.

To train a GAN, we first train the discriminator to optimal, which is maximizing:
L(D,gθ)=ExPr[logD(x)]+ExPg[log(1D(x))]L(D,g_\theta) = \mathbb E_{x\sim \mathbb P_r}[\log D(x)]+\mathbb E_{x\sim \mathbb P_g}[\log(1-D(x))]
and the optimal discriminator is obatined as:
D(x)=Pr(x)Pr(x)+Pg(x)D^*(x) = \frac{P_r(x)}{P_r(x)+P_g(x)}

Plug in the optimal discriminator to L(D,gθ)L(D,g_\theta), we get:
L(D,gθ)=2JSD(PrPg)2log2L(D^*, g_\theta) = 2JSD(\mathbb P_r||\mathbb P_g)-2\log2
So, minimizing L(D,gθ)L(D,g_\theta) as a function of θ\theta gives the minimized Jensen-Shannon divergence when the discriminator is optimal.

In theory, we would first train discriminator as close to optimal as we can, then do gradient steps on θ\theta, and alternating these two things to get our generator. \color{red} But this doesn’t work. In practice, as the discriminator gets better, the updates to the generator get consistently worse

Main Contribution

Based on the problem regarding GAN training, the authors proposed 4 questions to be answered:

  1. Why do updates get worse as the discriminator gets better? Both in the original and the new cost function.
  2. Why is GAN training massively unstable?
  3. Is the new cost function following a similar divergence to the JSD? If so, what are its properties?
  4. Is there a way to avoid some of these issues?

We first take a look at the source of instabilities

Sources of Instability

Discontinuity Conjecture

From the cost function and the optimal discriminator, we ‘know’ that the trained discriminator will have cost of at most 2log22JSD(PrPg)2\log2-2JSD(\mathbb P_r||\mathbb P_g). However, in practice, if we train D till convergence, its error will go to 0, pointing to the fact that the JSD between them is maxed out. The only way this can happen is if the distributions are not continuous, or they have disjoint supports. And one possible cause for the distribution to be discontinuous is if their supports lie on low dimensional manifolds. Previous work on GAN showed strong empirical and theoretical evidence to believe that Pr\mathbb P_r is indeed extremely concentrated on a low dimensional manifold. And the authors proved that this is the case as well for Pg\mathbb P_g.

Below regards DCGAN trained for 1,10,25 epochs. Then, with generator fixed, train a discriminator from scratch.

Lemma 1

Let g:ZXg:\mathcal Z\rightarrow\mathcal X be a function composed by affine transformations and pointwise non-linearities, which can either be rectifiers, leaky rectifiers, or smooth strictly increasing functions (such as the sigmoid, tanh, softplus, etc). Then, g(Z)g(\mathcal Z) is composed in a countable union of manifolds of dimension at most dim Z\mathcal Z. Therefore, if the dimension of Z\mathcal Z is less than the one of X\mathcal X, g(Z)g(\mathcal Z) will be a set of measure 0 in X\mathcal X.

The intuition behind this lemma is Pg\mathbb P_g is defined via sampling from a simple prior zp(z)z\sim p(z), and then applying a function g:ZXg:\mathcal Z\rightarrow\mathcal X, so the support of Pg\mathbb P_g has to be contained in g(Z)g(\mathcal Z). If the dimensionality of Z\mathcal Z is less than the dimension of X\mathcal X, then it’s impossible for Pg\mathbb P_g to be continuous.

Perfect Discrimination Theorem

Theorem 2.1

If two distributions Pr\mathbb P_r and Pg\mathbb P_g have support contained on two disjoint compact subsets M\mathcal M and P\mathcal P respectively, then there is a smooth optimal discriminator D:X[0,1]D^*:\mathcal X\rightarrow [0,1] that has accuracy 1 and xD(x)=0xMP\nabla_xD^*(x)=0 \forall x\in \mathcal M\cup\mathcal P.

Definition (Transversal Intersection)

Let M\mathcal M and P\mathcal P be two boundary free regular submanifolds of F\mathcal F, which in our case will simply be F=Rd\mathcal F = \mathbb R^d. Let xMPx\in \mathcal M\cap \mathcal P be an intersection point of the two manifolds. We say that M\mathcal M and P\mathcal P intersect transversally in xx if TxM+TxP=TxF\mathcal T_x\mathcal M+\mathcal T_x\mathcal P = \mathcal T_x\mathcal F, where TxM\mathcal T_x\mathcal M means the tangent space of M\mathcal M around xx.

Definition (Perfect Alignment)

We say that two manifolds without boundary M\mathcal M and P\mathcal P \textbf{perfectly align} if there exists xMPx\in \mathcal M \cap \mathcal P such that M\mathcal M and P\mathcal P don’t intersect transversally in xx. And let M\partial\mathcal M and P\partial\mathcal P denote the boundary of manifolds M\mathcal M and P\mathcal P, we say M\mathcal M and P\mathcal P are perfectly align if any of the boundary free manifold pairs (M\mathcal M, P\mathcal P), (M\mathcal M, P\partial\mathcal P), (M\partial\mathcal M, P\mathcal P),(M\partial\mathcal M, P\partial\mathcal P) perfectly align.

Lemma 2

Let M\mathcal M and P\mathcal P be two regular submanifolds of Rd\mathbb R^d that don’t have full dimension. Let η,η\eta, \eta' be arbitrary independent continuous random variables. We therefore define the perturbed manifolds as M~=M+η\mathcal{\tilde M}=\mathcal M+\eta and P~=P+η\tilde{\mathcal P}=\mathcal{P}+\eta'. Then [\mathbb P_{\eta, \eta’}(\mathcal{\tilde M} \text{ does not perfect align with }\mathcal{\tilde P}) = 1]

This implies that, in practice, we can safely assume that any two manifolds never perfectly assign, since arbitrarilly small perturbation on two manifolds will lead them to intersect transversally or don’t intersect at all.

Lemma 3

Let M\mathcal M and P\mathcal P be two regular submanifolds of Rd\mathbb R^d that don’t perfectly align and don’t have full dimension. Let L=MP\mathcal L=\mathcal M\cap\mathcal P. If M\mathcal M and P\mathcal P don’t have boundary, then L\mathcal L is also a manifold, and has strictly lower dimension than both the one of M\mathcal M and the one of P\mathcal P. If they have boundary, L\mathcal L is a union of at most 4 strictly lower dimensional manifolds. In both cases, L\mathcal L has measure 0 in both M\mathcal M and P\mathcal P.

If two manifolds don’t perfectly align, their intersection L=MP\mathcal L=\mathcal M\cap\mathcal P will be a finite union of manifolds with dimensions strictly lower than both the dimension of M\mathcal M and P\mathcal P.

Theorem 2.2 (The Perfect Discrimination Theorem)

Let Pr\mathbb P_r and Pg\mathbb P_g be two distributions that have support contained in two closed manifolds M\mathcal M and P\mathcal P that don’t perfectly align and don’t have full dimension. We further assume that Pr\mathbb P_r and Pg\mathbb P_g are continuous in their respective manifolds, meaning that if there is a set A with measure 0 in M\mathcal M, then Pr(A)=0\mathbb P_r(A)=0 (and analogously for Pg\mathbb P_g). \color{red} Then, there exists an optimal discriminator D:X[0,1]D^*:\mathcal X\rightarrow[0,1] that has accuracy 1 and for almost any x in M\mathcal M or P\mathcal P, DD^* is smooth in a neighbourhood of xx and xD(x)=0\nabla_xD^*(x)=0.

Combining the two theorems (2.1 and 2.2) stated together tells us that there are perfect discriminators which are smooth and constant almost everywhere in M\mathcal M and P\mathcal P. And the fact that the discriminator is constant in both manifolds points to the fact that we won’t be able to learn anything by backproping through it.

Theorem 2.3

Let Pr\mathbb P_r and Pg\mathbb P_g be two distributions whose support lies in two manifolds M\mathcal M and P\mathcal P that don’t have full dimension and don’t perfectly align.
We further assume that Pr\mathbb P_r and Pg\mathbb P_g are continuous in their respective manifolds. Then,
JSD(PrPg)=log2JSD(\mathbb P_r||\mathbb P_g)=\log2
KL(PrPg)=+KL(\mathbb P_r||\mathbb P_g)=+\infty
KL(PgPr)=+KL(\mathbb P_g||\mathbb P_r)=+\infty

This implies:

  1. divergences will be maxed out even if the two manifolds lie arbitrarilly close to each other (impossible to apply gradient descent)
  2. samples of our generator might look impressively good, yet both KL divergences will be infinity
  3. attempting to use divergences out of the box to test similarities between the distributions we typically consider might be a terrible idea

Vanishing Gradients on the Generator

Denote D=supxXD(x)+xD(x)2||D||=\sup_{x\in\mathcal X}|D(x)|+||\nabla_xD(x)||_2
(Note: The authors stated that this norm is used to make proofs simpler, and can be replaced with Sobolev norm 1,p||\cdot||_{1,p} for p<p<\infty covered by the universal approximation theorem in the sense that we can guarantee a neural network approximation in this norm)

Theorem 2.4

Let gθ:ZXg_\theta:\mathcal Z\rightarrow\mathcal X be a differentiable function that induces a distribution Pg\mathbb P_g. Let Pr\mathbb P_r be the real data distribution. Let D be a differentiable discriminator. If the conditions of Theorem 2.1 and 2.2 are satisfied, DD<ϵ||D-D^*||<\epsilon, and Ezp(z)[Jθgθ(z)22]M2\mathbb E_{z\sim p(z)}[||J_\theta g_\theta(z)||^2_2]\le M^2, then
θEzp(z)[log(1D(gθ(z)))]2<Mϵ1ϵ||\nabla_\theta\mathbb E_{z\sim p(z)}[\log(1-D(g_\theta(z)))]||_2<M\frac{\epsilon}{1-\epsilon}

Theorem 2.4 shows that as our discriminator gets better, the gradient of the generator vanishes.
And the figure below shows an experimental verfication of the above statement. First train a DCGAN for 1, 10 and 25 epochs. Then, with the generator fixed train a discriminator from scratch and measure gradients with the original cost function.

The logD-\log D Alternative

In original GAN paper, the authors conjectured that the instable behaviour is because of the choose of loss function so they proposed an alternative to “fix” it, but even with the alternative loss function, the problem remains, and yet little is known about the nature of the alternative loss.

To avoid vanishing gradient when the discrminator is very confident, an alternative gradient step for the generator is used:
Δθ=θEzp(z)[logD(gθ(z))]\Delta\theta=\nabla_\theta\mathbb E_{z\sim p(z)}[-\log D(g_\theta(z))]

Theorem 2.5

Let Pr\mathbb P_r and Pgθ\mathbb P_{g_\theta} be two continuous distributions, with densities Pr,PgθP_r,P_{g_\theta} respectively. Let D(x)=Pr(x)Pr(x)+Pgθ(x)D^*(x) = \frac{P_r(x)}{P_r(x)+P_{g_\theta}(x)} be the optimal discriminator, fixed for a value θ0\theta_0. Therefore,
Ezp(z)[θlogD(gθ(z))θ=θ0]=θ[KL(PgθPr)2JSD(PgθPr)]θ=θ0\mathbb E_{z\sim p(z)}[-\nabla_\theta\log D^*(g_\theta(z))|_{\theta=\theta_0}] =\nabla_\theta[KL(\mathbb P_{g_\theta}||\mathbb P_r)-2JSD(\mathbb P_{g_\theta}||\mathbb P_r)]|_{\theta=\theta_0}

Theorem 2.6

Let gθ:ZXg_\theta:\mathcal Z\rightarrow \mathcal X be a differentiable function that induces a distribution Pg\mathbb P_g. Let Pr\mathbb P_r be the real data distribution, with either conditions of Theorems 2.1 and 2.2 satisfied. Let D be a discriminator such that DD=ϵD^*-D=\epsilon is a centered Gaussian process indexed by xx and independent for every xx and xDxD=r\nabla_xD^*-\nabla_xD=r another independent centered Gaussian process indexed by xx and independent of every xx. Then, each coordinate of
Ezp(z)[θlogD(gθ(z))]\mathbb E_{z\sim p(z)}[-\nabla_\theta\log D(g_\theta(z))]
is centered Cauchy distribution with infinite expectation and variance.

Theorem 2.6 implies that we would have large variance in gradients, and the instable update would actually lower sample quality. Moreover, the fact that the distribution updates are centered means that, if we bound the updates, the expected update would be 0, providing no feedback to the gradient.

Towards Softer Metrics and Distributions

Break Assumptions

To fix the Instability and vanishing gradients issue, we need to break the assumptions of previous theorems. And the authors choose to add continuous noise to the inputs of the discriminator, therefore, smoothening the distribution of the probability mass.

Theorem 3.1

If X has distribution PX\mathbb {P_X} with support on M\mathcal M and ϵ\epsilon is an absolutely continuous random variable with density PϵP_\epsilon, then PX+ϵ\mathbb P_{X+\epsilon} is absolutely continuous with density
PX+ϵ=EyPX[Pϵ(xy)]P_{X+\epsilon}=\mathbb E_{y\sim\mathbb P_X}[P_\epsilon(x-y)]
$ = \int_{\mathcal M}P_\epsilon(x-y)d\mathbb P_X(y)$

Theorem 3.2

Let Pr\mathbb P_r and Pg\mathbb P_g be two distributions with support on M\mathcal M and P\mathcal P respectively, with ϵN(0,σ2I)\epsilon\sim\mathcal N(0,\sigma^2I). Then, the gradient passed to the generator has the form
Ezp(z)[θlog(1D(gθ(z)))]\mathbb E_{z\sim p(z)}[\nabla_\theta\log(1-D^*(g_\theta(z)))]
=Ezp(z)[a(z)MPϵ(gθ(z)y)θgθ(z)y2dPr(y)= \mathbb E_{z\sim p(z)}[a(z)\int_{\mathcal M}P_\epsilon(g_\theta(z)-y)\nabla_\theta||g_\theta(z)-y||^2d\mathbb P_r(y)
b(z)PPϵ(gθ(z)y)θgθ(z)y2dPg(y)]-b(z)\int_{\mathcal P}P_\epsilon(g_\theta(z)-y)\nabla_\theta||g_\theta(z)-y||^2d\mathbb P_g(y)]
This theorem proves that we will drive our samples gθ(z)g_\theta(z) towards points along the data manifold, weighted by their probability and the distance from our samples.

To protect the discriminator from measure 0 adversarial examples, it is important to backprop through noisy samples in the generator as well.

Corollary

Let ϵ,epsilonN(0,σ2I)\epsilon,epsilon'\sim\mathcal N(0,\sigma^2I) and g~θ(z)=gθ(z)+ϵ\tilde g_\theta(z) = g_\theta(z) + \epsilon', then
Ezp(z)[θlog(1D(g~θ(z)))]\mathbb E_{z\sim p(z)}[\nabla_\theta\log(1-D^*(\tilde g_\theta(z)))]
$ = \mathbb E_{z\sim p(z)}[a(z)\int_{\mathcal M}P_\epsilon(\tilde g_\theta(z)-y)\nabla_\theta||\tilde g_\theta(z)-y||^2d\mathbb P_r(y)-b(z)\int_{\mathcal P}P_\epsilon(\tilde g_\theta(z)-y)\nabla_\theta||\tilde g_\theta(z)-y||^2d\mathbb P_g(y)]$
=2θJSD(Pr+ϵPg+ϵ)= 2\nabla_\theta JSD(\mathbb P_{r+\epsilon}||\mathbb P_{g+\epsilon})

Wasserstein Metric

The other way to go is to use Wasserstein Metric to replace traditional JSD distance.

Definition

[W(P,Q)=\inf_{\gamma\in\Gamma}\int_{\mathcal X\times\mathcal X}||x-y||_2d\gamma(x,y)] where Γ\Gamma is the set of all possible joints on X×X\mathcal X\times\mathcal X that have marginals P and Q.

Wasserstein Distance is also called Earth Mover’s Distance (EMD), it’s the minimum cost of transporting the whole probability mass of P from its support to match the probability mass of Q on Q’s support.

Intuitively, given two distributions, one can be seen as a mass of earth properly spread in space, the other as a collection of holes in that same space. Then, the EMD measures the least amount of work needed to fill the holes with earth. Here, a unit of work corresponds to transporting a unit of earth by a unit of ground distance.

Lemma 4

Let ϵ\epsilon be a random vector with mean 0, then we have
W(PX,PX+ϵ)V12W(\mathbb P_X,\mathbb P_{X+\epsilon})\le V^{\frac{1}{2}}
where V=E[ϵ22]V=\mathbb E[||\epsilon||^2_2] is the variance of ϵ\epsilon.

Theorem 3.3

Let Pr\mathbb P_r and Pg\mathbb P_g be any two distributions, and ϵ\epsilon be a random vector with mean 0 and variance V. If Pr+ϵ\mathbb P_{r+\epsilon} and Pg+ϵ\mathbb P_{g+\epsilon} have support contained on a ball of diameter C, then
W(Pr,Pg)2V12+2CJSD(Pr+ϵPg+ϵ)W(\mathbb P_r,\mathbb P_g)\le 2V^{\frac{1}{2}}+2C\sqrt{JSD(\mathbb P_{r+\epsilon}||\mathbb P_{g+\epsilon})}

This theorem implies that Wasserstein Distance can be controlled since both parts of this upper bound can be manually controlled.