为什么tf.contrib.layers.instance_norm层包含StopGradient操作?也就是说,为什么需要它?

似乎在更简单的层tf.nn.moments中也有StopGradient (它可以是tf.contrib.layers.instance_norm的构建块)。
x_m, x_v = tf.nn.moments(x, [1, 2], keep_dims=True)

我还在tf.nn.moments源代码中找到了关于StopGradient的注释:
# The dynamic range of fp16 is too limited to support the collection of
# sufficient statistics. As a workaround we simply perform the operations
# on 32-bit floats before converting the mean and variance back to fp16
y = math_ops.cast(x, dtypes.float32) if x.dtype == dtypes.float16 else x
# Compute true mean while keeping the dims for proper broadcasting.
mean = math_ops.reduce_mean(y, axes, keepdims=True, name="mean")
# sample variance, not unbiased variance
# Note: stop_gradient does not change the gradient that gets
# backpropagated to the mean from the variance calculation,
# because that gradient is zero
variance = math_ops.reduce_mean(
math_ops.squared_difference(y, array_ops.stop_gradient(mean)),
axes,
keepdims=True,
name="variance")所以这是一种优化,因为梯度总是为零?
发布于 2020-11-11 08:35:30
尝试回答。
这个设计告诉我们,最小化二阶矩,我们不想通过一阶矩传播梯度。这有意义吗?如果我们试图最小化E[x^2]-E[x]^2,我们就会最小化E[x^2],同时最大化E[x]^2。第一项将减少每个元素的绝对值(将它们拖到中心)。第二项将通过梯度增加所有值,这将无助于最小化方差,但可能会对其他梯度路径产生负面影响。
因此,我们不会通过一阶矩传播二阶矩的梯度,因为这个梯度不会影响二阶矩,至少在使用普通的SGD时是这样。
https://stackoverflow.com/questions/64776769
复制相似问题