Generative Adversarial Networks (GANs) are notoriously hard to train. In a recent paper, we presented an idea that might help remedy this.
Our intern Casper spent the summer working with GANs, resulting in a paper which appeared on arXiv this week. One particular technique did us great service: instance noise. It's not the main focus of Casper's paper, so the details have been relegated to an Appendix. We thought it's a good idea to summarise it here, giving a few more details. Naturally, I think the full paper is also worth a read, there are a few more interesting things in there:
There are different ways to think about GANs: you can approach it from a game theoretic view of seeking Nash equilibrium (Salimans et al, 2016), or you can treat it as an E-M like iterative algorithm where the discriminator's job is likelihood ratio estimation (Mohamed et al, 2016, Uehara et al, 2016, Nowozin et al). If you've read my earlier posts, it should come as no surprise that I subscribe to the latter view.
Consider the following idealised GAN algorithm, each iteration consisting of the following steps:
So why don't they?
Crucially, the convergence of this algorithm relies on a few assumptions never really made explicit that don't always hold:
3 that the Bayes-optimal solution to the logistic regression problem is unique: there is a single optimal discriminator that does a much better job than any other classifier.
1 the log-likelihood-ratio and therefore KL divergence is infinite and not well defined
The main ways to avoid these pathologies involve making the discriminator's job harder. Why? The JSJS divergence is constant locally in θθ, but it doesn't mean that the variational lower bound also has to be constant. Indeed, if you cripple the discriminator so the lower bound is not tight, you may end up with a non-constant function of θθ that will roughly guide you to the right direction.
An example of this crippling is that in most GAN implementations the discriminator is only partially updated in each iteration, rather than trained until convergence. This extreme form of early stopping is a form of regularisation that prevents the discriminator from overfitting.
Another way to cripple the discriminator is adding label noise, or equivalently, one-sided label smoothing as introduced by Salimans et al, (2016). In this technique the labels in the discriminator's training data are randomly flipped. Let's illustrate this technique in two figures.
The classifiaction view:
discriminators have a harder job, they are all punished evenly: there is no way for the discriminator to be smart about handling label noise. Adding label noise doesn't change the structure of the logistic regression loss landscape dramatically, it mainly just pushes everything up. Hence, there are still a large number of near-optimal discriminators. Adding label noise still does not allow us to pinpoint a single unique Bayes-optimal classifier. The JS divergence is not saturated to its maximum level anymore, but it is still locally constant in
An alrernative way to think about instance noise vs. label noise is via graphical models. The following three graphical models define joint distributions, parametrised by θθ. The GAN algorithm tries to adjust θθ so as to minimise the mutual information between the highlighted nodes in these graphical models:
Here's what the variables are:
原文发布于微信公众号 - CreateAMind（createamind）