Dropout含义及其实现(反向随机失活)

Steve Wang

发布于 2019-05-26 15:06:27

1.1K0

发布于 2019-05-26 15:06:27

文章被收录于专栏：从流域到海域

Implementing Dropout(Inverted dropout)

整理并翻译自吴恩达深度学习视频 https://mooc.study.163.com/learn/2001281003?tid=2001391036#/learn/content?type=detail&id=2001702117&cid=2001694284

Dropout目前常用随机反向失活，即随机使一些神经元失去作用，降低隐藏层结构的复杂度，使其退化成相对简单的网络来降低高方差。

我们用layer = 3来表述。

取keep_prob = 0.8.意味着有80%的概率某个神经元会被保留，20%概率会被消去。

d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep.prob 所有小于keep.prob的都置为0，其余置1.

a3 = np.multiply(a3, d3)

a3 /= keep_prob 除以keep_prob它会大致矫正或者补足你丢失的20%，以确保a3的期望值仍然维持在同一水准。这也会使你在测试时轻松一点，因为你没有增加额外的缩放问题(scaling problem)。

测试时我们不需要执行Dropout操作，你没有必要随机化你的输出。如果你在测试阶段使用了Dropout，那只会增加你预测上的噪音。理论上，你可以多次用不用的隐藏单元运行随机化的Dropout，但也只会给你不用Dropout一样的结果。这会耗费了你的计算效率，但给了你同样的结果。

It’s kind of like cranking up the regularization parameter lambda for L2 regularization where you try to regularize some layers more than others. And technically, you can also apply dropout to the input layer,where you can have some chance of just maxing out one or more of the input features. Although in practice, usually don’t do that often. And so keep_prob of one point zero is quite common for the input layer. You can also use a very high value, maybe zero point nine, but it’s much less likely that you want to eliminate half of the input features. So usually keep_prob, if you apply the law, would be a number close to one, if you even apply dropout at all to the input layer.

The downsides is ,this gives you even more hyper parameters to search for using cross-validation. One other alternative might be to have some layers where you apply dropout and some layers where you don’t apply dropout and the just have one hyper parameter, which is the keep_prob for the layers for which you do apply dropout.

But really the thing to remember is that dropout is a regularization technique, it helps prevent over-fitting. And so unless my algorithm is over-fitting, I wouldn’t actually bother to use dropout. SO it’s used somewhat less often than other application areas. There is just with computer vision, you usually just don’t have enough data, so you’re almost always overfitting, which is why there tends to be some computer vision researchers who swear by dropout. But by intuition, I doesn’t always generalize I think to other disciplines.

One big downside of dropout is that the cost function J is no longer well-defined. On every iteration, you are randomly killing off a bunch of nodes, and so, if you are double checking the performance ofgradient descent, it’s actually harder to double check that. Because the cost function J that you’re optimizing is actually less well-defind, or is certainly hard to calculate. So you lose this debugging tool to will a plot, a graph like this. So what I usually do is turn off dropout, you will set keep_prob equals one and run my code and make sure that it is monotonically decreasing J, and then turn on dropout and hope that I didn’t introduce bugs into my code during dropout.

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018年11月28日，如有侵权请联系 cloudcommunity@tencent.com 删除

bash