噪音：让生成模型稳定训练的技巧

CreateAMind

发布于 2018-07-25 11:08:18

8510

发布于 2018-07-25 11:08:18

文章被收录于专栏：CreateAMind

Instance Noise: A trick for stabilising GAN training

with Casper Kaae Sønderby

Generative Adversarial Networks (GANs) are notoriously hard to train. In a recent paper, we presented an idea that might help remedy this.

介绍缘由，起因。

Our intern Casper spent the summer working with GANs, resulting in a paper which appeared on arXiv this week. One particular technique did us great service: instance noise. It's not the main focus of Casper's paper, so the details have been relegated to an Appendix. We thought it's a good idea to summarise it here, giving a few more details. Naturally, I think the full paper is also worth a read, there are a few more interesting things in there:

Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi and Ferenc Huszár (2016) Amortised MAP Inference for Image Super-Resolution

Instance noise

Summary

we think a major reason for GANs' instability may be that the generative distributions are weird, degenerate, and their support don't generally overlap with the true data distribution.
this makes the nice theory break down and may lead to unstable behaviour
we suggest adding noise to both real and synthetic data during training might help overcome these problems
in this note we motivate this technique and illustrate in a few figures how it helps training

GANs should work.

There are different ways to think about GANs: you can approach it from a game theoretic view of seeking Nash equilibrium (Salimans et al, 2016), or you can treat it as an E-M like iterative algorithm where the discriminator's job is likelihood ratio estimation (Mohamed et al, 2016, Uehara et al, 2016, Nowozin et al). If you've read my earlier posts, it should come as no surprise that I subscribe to the latter view.

Consider the following idealised GAN algorithm, each iteration consisting of the following steps:

So why don't they?

Crucially, the convergence of this algorithm relies on a few assumptions never really made explicit that don't always hold:

3 that the Bayes-optimal solution to the logistic regression problem is unique: there is a single optimal discriminator that does a much better job than any other classifier.

1 the log-likelihood-ratio and therefore KL divergence is infinite and not well defined

How to fix this?

The main ways to avoid these pathologies involve making the discriminator's job harder. Why? The JSJS divergence is constant locally in θθ, but it doesn't mean that the variational lower bound also has to be constant. Indeed, if you cripple the discriminator so the lower bound is not tight, you may end up with a non-constant function of θθ that will roughly guide you to the right direction.

An example of this crippling is that in most GAN implementations the discriminator is only partially updated in each iteration, rather than trained until convergence. This extreme form of early stopping is a form of regularisation that prevents the discriminator from overfitting.

Another way to cripple the discriminator is adding label noise, or equivalently, one-sided label smoothing as introduced by Salimans et al, (2016). In this technique the labels in the discriminator's training data are randomly flipped. Let's illustrate this technique in two figures.

The classifiaction view:

discriminators have a harder job, they are all punished evenly: there is no way for the discriminator to be smart about handling label noise. Adding label noise doesn't change the structure of the logistic regression loss landscape dramatically, it mainly just pushes everything up. Hence, there are still a large number of near-optimal discriminators. Adding label noise still does not allow us to pinpoint a single unique Bayes-optimal classifier. The JS divergence is not saturated to its maximum level anymore, but it is still locally constant in

θθ.

The graphical model view:

An alrernative way to think about instance noise vs. label noise is via graphical models. The following three graphical models define joint distributions, parametrised by θθ. The GAN algorithm tries to adjust θθ so as to minimise the mutual information between the highlighted nodes in these graphical models: