最重要的论文之一：无监督的语义特征学习论文翻译及代码

CreateAMind

发布于 2018-07-20 16:17:59

6980

发布于 2018-07-20 16:17:59

文章被收录于专栏：CreateAMindCreateAMind

欢迎关注并置顶本公众号一起学习深度学习！

我们先看看两段对此论文的解读，然后是论文翻译

二 from：http://www.jianshu.com/p/db87c51de510

InfoGAN (code). Peter Chen 和同时给出了 InfoGAN —— 一个对 GAN 的扩展，学习图像的去纠缠的和可解释的表示. 正常的 GAN 通过用模型重新产生数据分布达到这个目的，但是代码空间的 layout 和组织是 underspecified —— 存在很多可能的解可以将单位 Gaussian 映射到图像上，最终获得的可能是非常复杂和高度纠缠的. InfoGAN 对该空间引入了额外的结构通过增加一个新的包含最大化表示向量和观测值的小的子集的互信息目标函数. 这个方法给出相当出色的结果. 例如，在 3D 人脸图像中，我们变动代码的一个连续的维度，保持其他维度不变. 很明显从 5 个提供的例子中（每一行），代码的结果维度刻画了可解释的维度，并且模型在没有告诉这些重要特征存在的情形下可能已经理解到是存在摄像头角度、面部变化等等：

我们同样注意到良好的，去纠缠的表示之前已经有了这样的研究（如 Kulkarni 等人的DC-IGN），但这些方法依赖于额外的监督信息，而现在的这个方法则是完全非监督的.

一 from：https://mp.weixin.qq.com/s?__biz=MzI1NTE4NTUwOQ==&mid=2650325352&idx=1&sn=90fb15cee44fa7175a804418259d352e

最后一种改进 GAN 的训练稳定性的方式则更加贴近本质，也是最新的研究成果。这便是号称 openAI 近期五大突破之一的 infoGAN [7] 。InfoGAN [7] 的出发点是，既然 GAN 的自由度是由于仅有一个 noise z，而无法控制 GAN 如何利用这个 z。那么我们就尽量去想办法在 “如何利用 z” 上做文章。于是，[7] 中将 z 做了拆解，认为 GAN 中生成模型（G）应该包含的 “先验” 分成两种：（1）不能再做压缩的 noise z；（2）和可解释地、有隐含意义的一组隐变量 c_1, c_2, …, c_L，简写为 c 。这里面的思想主要是，当我们学习生成图像时，图像有许多可控的有含义的维度，比如笔划的粗细、图片的光照方向等等，这些便是 c ；而剩下的不知道怎么描述的便是 z 。这样一来，[7] 实际上是希望通过拆解先验的方式，让 GAN 能学出更加 disentangled 的数据表示（representation），从而既能控制 GAN 的学习过程，又能使得学出来的结果更加具备可解释性。为了引入这个 c ，[7] 利用了互信息的建模方式，即 c 应该和生成模型（G）基于 z 和 c 生成的图片，即 G ( z,c )，高度相关 —— 互信息大。利用这种更加细致的隐变量建模控制，infoGAN 可以说将 GAN 的发展又推动了一步。首先，它们证明了 infoGAN 中的 c 对于 GAN 的训练是有确实的帮助的，即能使得生成模型（G）学出更符合真实数据的结果。其次，他们利用 c 的天然特性，控制 c 的维度，使得 infoGAN 能控制生成的图片在某一个特定语义维度的变化。

论文翻译：

（文本中的公式很多复制后格式改变有差异，请看原论文准确公式）

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Abstract

This paper describes InfoGAN, an information-theoretic extension to the Gener- ative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner.

本论文描述InfoGAN，信息论扩展的生成模型，它可以无监督的学习解构的和可解释的表示。

InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation.

InfoGan是生成对抗网络，可以最大化小集合的观察和隐变量的互信息

We derive a lower bound of the mutual information objective that can be optimized efficiently.

我们推导出一个下界的相互信息的目标，可以有效地优化。

Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset.

具体来说，infogan成功解构了MNIST数据集的数字形状的书写风格，3D渲染图像的光照姿势，和背景数字从svhn数据的中心数字。

It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing supervised methods.

而且发现了视觉概念包括：发型、是否带眼镜、脸部情绪。实验显示InfoGan学习到了媲美监督学习可解释的特征

1 Introduction

Unsupervised learning can be described as the general problem of extracting value from unlabelled data which exists in vast quantities.

无监督学习可以描述为从大量无标签数据抽取信息的普遍问题

A popular framework for unsupervised learning is that of representation learning [1, 2], whose goal is to use unlabelled data to learn a representation that exposes important semantic features as easily decodable factors. A method that can learn such representations is likely to exist [2], and to be useful for many downstream tasks which include classification, regression, visualization, and policy learning in reinforcement learning.

通常的无监督学习框架是表示学习，目标是使用无标签数据学习特征表示，特征能够展示重要语义特征，就像简单解码的因素。wake-sleep" algorithm for unsupervised 是一种可以学习这种表示特征的方法，对很多下游任务非常有帮助：分类、回归、可视化、强化学习的策略学习。

While unsupervised learning is ill-posed because the relevant downstream tasks are unknown at training time, a disentangled representation, one which explicitly represents the salient attributes of a data instance, should be helpful for the relevant but unknown tasks.

但是无监督学习是病态的，因为相关下游任务在训练时是未知的，解构的特征表示-明确的代表数据的最突出属性-对相关但是未知任务是有帮助的

For example, for a dataset of faces, a useful disentangled representation may allocate a separate set of dimensions for each of the following attributes: facial expression, eye color, hairstyle, presence or absence of eyeglasses, and the identity of the corresponding person.

比如，人脸数据集，有用的解构表示特征包括如下属性的不相干的类别集：面部表情，眼珠颜色、发型、是否带眼镜、及人的确认信息

A disentangled representation can be useful for natural tasks that require knowledge of the salient attributes of the data, which include tasks like face recognition and object recognition.

解构表示对要求突出属性的数据知识的自然任务非常有用，包括物体和人脸识别。

It is not the case for unnatural supervised tasks, where the goal could be, for example, to determine whether the number of red pixels in an image is even or odd. Thus, to be useful, an unsupervised learning algorithm must in effect correctly guess the likely set of downstream classification tasks without being directly exposed to them.

不是自然的监督任务的情况下，它的目标可以是，例如，确定图像中的红色像素的个数是偶数还是奇数。因此，有用的是，一种无监督学习算法必须正确地猜测可能的下游分类任务，即使不被直接暴露于他们。

A significant fraction of unsupervised learning research is driven by generative modelling. It is motivated by the belief that the ability to synthesize, or “create” the observed data entails some form of understanding, and it is hoped that a good generative model will automatically learn a disentangled representation, even though it is easy to construct perfect generative models with arbitrarily bad representations. The most prominent generative models are the variational autoencoder (VAE) [3] and the generative adversarial network (GAN) [4].

非监督学习研究的一个显着部分是由生成建模。

动力是出于：我能创造的，才是我理解的。好的生成模型能自动学习到解构表示，尽管很容易构建完善的生成模型即使是不好的表示特征。

最突出的生成模型的变分自编码（VAE）[ 3 ]和生成性网络（GaN）[ 4 ]。

In this paper, we present a simple modification to the generative adversarial network objective that encourages it to learn interpretable and meaningful representations.

We do so by maximizing the mutual information between a fixed small subset of the GAN’s noise variables and the observations, which turns out to be relatively straightforward.

Despite its simplicity, we found our method to be surprisingly effective: it was able to discover highly semantic and meaningful hidden representations on a number of image datasets: digits (MNIST), faces (CelebA), and house numbers (SVHN).

The quality of our unsupervised disentangled representation matches previous works that made use of supervised label information [5–9]. These results suggest that generative modelling augmented with a mutual information cost could be a fruitful approach for learning disentangled representations.

在本文中，我们提出了一个简单的修改生成的对抗网络的目标，鼓励模型学习可解释和有意义的表示。

我们这样做，通过最大化一个固定小数据集的互信息噪声变量和观测，这是比较简单的。

尽管简单，我们发现我们的方法非常有效：它能够发现高度语义和非常有意义隐层表示-在大量图像数据：数字（MNIST）、人脸（celeba），和门牌号码（svhn）。

我们的无监督分离表征的质量媲美以前的工作，利用标签信息监督学习。这些结果表明，生成的模型增强用互信息增强，可能是学习解构表示一个卓有成效的方法。

In the remainder of the paper, we begin with a review of the related work, noting the supervision that is required by previous methods that learn disentangled representations. Then we review GANs, which is the basis of InfoGAN. We describe how maximizing mutual information results in interpretable representations and derive a simple and efficient algorithm for doing so. Finally, in the experiments section, we first compare InfoGAN with prior approaches on relatively clean datasets and then show that InfoGAN can learn interpretable representations on complex datasets where no previous unsupervised approach is known to learn representations of comparable quality.

论文其余部分，我们看一下之前相关工作，特别是监督学习解构表示，在看GANs，这是InfoGAN的基础，描述，最大互信息是如何做到可解释的表示，及一个简单高效算法，最后实验部分。。。。

2 Related Work 略

VAE 很重要！之前工作是监督学习，而且INFOGAN效率高

3 Background: Generative Adversarial Networks

Goodfellow et al. [4] introduced the Generative Adversarial Networks (GAN), a framework for training deep generative models using a minimax game. The goal is to learn a generator distribution PG(x) that matches the real data distribution Pdata(x).

Goodfellow开创了GAN生成模型，使用minimax游戏对抗训练深度生成模型的框架

目标是学习生成的数据分布符合实际的数据分布

Instead of trying to explicitly assign probability to every x in the data distribution, GAN learns a generator network G that generates samples from the generator distribution PG by transforming a noise variable z ⇠ Pnoise(z) into a sample G(z).

This generator is trained by playing against an adversarial discriminator network D that aims to distinguish between samples from the true data distribution Pdata and the generator’s distribution PG. So for a given generator, the optimal discriminator is D(x) = Pdata(x)/(Pdata(x) + PG(x)). More formally, the minimax game is given by the following expression:

参考之前的文章，生成模型介绍

4 Mutual Information for Inducing Latent Codes

The GAN formulation uses a simple factored continuous input noise vector z, while imposing no restrictions on the manner in which the generator may use this noise. As a result, it is possible that the noise will be used by the generator in a highly entangled way, causing the individual dimensions of z to not correspond to semantic features of the data.

GAN的构建中使用了一个简单的因素连续输入噪声向量Z，而不

限制生成模型以何种方式使用这个噪音。其结果是，它是可能的

噪音将被生成模型在高度纠缠的方式使用，造成单独的维度

的不对应于数据的语义特征。

However, many domains naturally decompose into a set of semantically meaningful factors of variation. For instance, when generating images from the MNIST dataset, it would be ideal if the model automatically chose to allocate a discrete random variable to represent the numerical identity of the digit (0-9), and chose to have two additional continuous variables that represent the digits angle and thickness of the digit’s stroke. It is the case that these attributes are both independent and salient, and it would be useful if we could recover these concepts without any supervision, by simply specifying that an MNIST digit is generated by an independent 1-of-10 variable and two independent continuous variables.

很多领域自然的分解为一系列语言意义的因素变化，比如生成模型生成数字，理想的是模型可以自动选择分配一个离散的随机变量来表示数字标识0-9，而且，选择两个额外的连续变量表示数字的角度和厚度，这些属性是独立显著的，而且如果我们可以无监督学习这些概念是很有用的，通过简单指定数字和其他两个属性。

In this paper, rather than using a single unstructured noise vector, we propose to decompose the input noise vector into two parts: (i) z, which is treated as source of incompressible noise; (ii) c, which we will call the latent code and will target the salient structured semantic features of the data distribution.

我们不使用单一的非结构化的噪声向量，我们使用分解的输入

噪声矢量分为两部分：（I）Z，这是被视为不可压缩噪声源；（II）C，我们

将调用的潜在代码，并将目标的显着的结构化的语义特征的数据分布。

Mathematically, we denote the set of structured latent variables by c1, c2,Q. . . , cL. In its simplest form, we may assume a factored distribution, given by P (c1, c2, . . . , cL) = Li=1 P (ci). For ease of notation, we will use latent codes c to denote the concatenation of all latent variables ci.

在数学上，我们表示结构的潜变量由C1、C2，CL在其最简单的形式，我们可以假设因素的分布，给出了P（C1，C2，cl）...为便于表示，我们将使用隐码C表示所有潜在变量的连接词。

We now propose a method for discovering these latent factors in an unsupervised way: we provide the generator network with both the incompressible noise z and the latent code c, so the form of the generator becomes G(z, c). However, in standard GAN, the generator is free to ignore the additional latent code c by finding a solution satisfying PG(x|c) = PG(x). To cope with the problem of trivial codes, we propose an information-theoretic regularization: there should be high mutual information between latent codes c and generator distribution G(z, c). Thus I(c; G(z, c)) should be high.

我们现在提出了一种方法，用于发现这些潜在的因素在一种无监督的方式：我们提供生成网络中的不可压缩噪声Z和C的潜代码，所以生成模型形式变为G（Z，C）。然而，在标准的GAN中，忽略额外的

潜在的C代码找到一个解决方案满足PG（x | C）= PG（X）。我们提出了一个信息理论的正规化：应该有高互信息

之间的潜在的代码C和生成分布G（Z，C）。因此，I（C；G（Z，C））应该是高的。

In information theory, mutual information between X and Y , I(X; Y ), measures the “amount of information” learned from knowledge of random variable Y about the other random variable X. The mutual information can be expressed as the difference of two entropy terms:

在信息理论中，X和Y之间的互信息，I（X；Y），测量的“量”

从关于其他随机变量X的随机变量Y的知识学习

互信息可以表示为两个熵项的差异：

I(X; Y ) = H(X) - H(X|Y ) = H(Y ) - H(Y |X) (2)

This definition has an intuitive interpretation: I (X ; Y ) is the reduction of uncertainty in X when Yis observed. If X and Y are independent, then I (X ; Y ) = 0, because knowing one variable reveals nothing about the other; by contrast, if X and Y are related by a deterministic, invertible function, then maximal mutual information is attained. This interpretation makes it easy to formulate a cost: given any x ⇠ PG(x), we want PG(c|x) to have a small entropy. In other words, the information in the latent code c should not be lost in the generation process. Similar mutual information inspired objectives have been considered before in the context of clustering [26–28]. Therefore, we propose to solve the following information-regularized minimax game:

这个定义有一个直观的解释：I（X；Y）是减少x不确定性当观察y。如果X和Y是独立的，那么我（X；Y）= 0，因为知道一个变量揭示不了另外一个变量

；相反，如果x和y是由确定性相关，可逆函数，

达到最大互信息。这种解释使它很容易制定成本：

任何X⇠PG（x），我们想要PG（C | x）有一个小的熵。换言之，在

在生成过程中，不应该丢失潜在的变量C的信息。互信息启发

文献[ 26，28 ]。因此，我们建议

解决以下信息正规化极小极大博弈：

5 Variational Mutual Information Maximization

In practice, the mutual information term I(c;G(z,c)) is hard to maximize directly as it requires access to the posterior P (c|x). Fortunately we can obtain a lower bound of it by defining an auxiliary

通常互信息很难直接优化，需要知道后验分布 P (c|x)，幸运的是，我们定义支持分布Q(c|x)获得下界。

This technique of lower bounding mutual information is known as Variational Information Maximization [29]. We note in addition that the entropy of latent codes H(c) can be optimized over as well since for common distributions it has a simple analytical form. However, in this paper we opt for simplicity by fixing the latent code distribution and we will treat H(c) as a constant. So far we have bypassed the problem of having to compute the posterior P (c|x) explicitly via this lower bound but we still need to be able to sample from the posterior in the inner expectation. Next we state a simple lemma, with its proof deferred to Appendix, that removes the need to sample from the posterior.

这种下界互信息的技术被称为变分信息最大化[ 29 ]。我们注意到，此外，潜在的变量H（C）的熵可以被优化，因为它有一个简单的分析形式的常见分布。然而，在本文中，我们选择简单，通过固定潜在的变量分布，我们将把H（C）作为一个常数。到目前为止我们已经绕过有计算后验P（C | x）明确通过这个下限，但我们仍然需要能够采样后在内期望。下一步，我们提出一个简单的引理，它的证明推迟到附录，消除了需要从后验。

We note that LI (G, Q) is easy to approximate with Monte Carlo simulation. In particular, LI can be maximized w.r.t. Q directly and w.r.t. G via the reparametrization trick. Hence LI (G, Q) can be added to GAN’s objectives with no change to GAN’s training procedure and we call the resulting algorithm Information Maximizing Generative Adversarial Networks (InfoGAN).

我们注意到，L1（G，Q）是很容易通过蒙特卡洛模拟逼近。LI (G, Q) 可以加到GAN而不改变训练过程，算法起名为INFOGAN

Eq (4) shows that the lower bound becomes tight as the auxiliary distribution Q approaches the true posterior distribution: Ex[DKL(P(·|x) k Q(·|x))] ! 0. In addition, we know that when the variational lower bound attains its maximum LI (G, Q) = H (c) for discrete latent codes, the bound becomes tight and the maximal mutual information is achieved. In Appendix, we note how InfoGAN can be connected to the Wake-Sleep algorithm [30] to provide an alternative interpretation.

InfoGAN 可关联到 Wake-Sleep 算法

Hence, InfoGAN is defined as the following minimax game with a variational regularization of mutual information and a hyperparameter : 定义如下：

6 Implementation

In practice, we parametrize the auxiliary distribution Q as a neural network. In most experiments, Qand D share all convolutional layers and there is one final fully connected layer to output parameters for the conditional distribution Q(c|x), which means InfoGAN only adds a negligible computation cost to GAN. We have also observed that LI (G, Q) always converges faster than normal GAN objectives and hence InfoGAN essentially comes for free with GAN.

Q使用神经网络，QD共享卷积层，最后是全连接输出Q(c|x)的分布。infogan增加的计算很少几乎是免费的好处。

For categorical latent code ci, we use the natural choice of softmax nonlinearity to represent Q(ci|x).For continuous latent code cj , there are more options depending on what is the true posterior P (cj |x).In our experiments, we have found that simply treating Q(cj |x) as a factored Gaussian is sufficient.

分类的隐变量使用softmax，连续的隐变量依赖于分布情况

Even though InfoGAN introduces an extra hyperparameter , it’s easy to tune and simply setting to 1is sufficient for discrete latent codes. When the latent code contains continuous variables, a smaller is typically used to ensure that LI (G, Q), which now involves differential entropy, is on the same scale as GAN objectives.

infogan带来了超参数，

Since GAN is known to be difficult to train, we design our experiments based on existing techniques introduced by DC-GAN [19], which are enough to stabilize InfoGAN training and we did not have to introduce new trick. Detailed experimental setup is described in Appendix.

架构参考了DC-GAN

下载代码，可以直接运行成功：https://github.com/openai/InfoGAN

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2016-09-04，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 CreateAMind 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度