# [MachineLearning] 超参数之LearningRate

### Gardient Descent

• := 是赋值操作
• $J(\theta_{0},\theta_{1})$是代价函数
• $\alpha$是learning rate,它控制我们以多大的幅度更新这个参数$\theta_{J}$. 当偏导数部分为0时,即已经到达极小值,梯度便不再下降.这也说明$\alpha$保持不变时,梯度下降也可以收敛到局部最低点
• 梯度下降操作是同时更新 $\theta_{0}$和$\theta_{1}$的.

SGD和minibatch-SGD Stochastic Gradient Descent是随机梯度下降,每次计算只用一个随机样本 minibatch-SGD 一次采用batch size的样本做梯度

### Learning rate

• LR很小时,训练会变得可靠,也就是说梯度会向着最/极小值一步步靠近.算出来的loss会越来越小.但代价是,下降的速度很慢,训练时间会很长.
• LR很大时,训练会越过最/极小值,表现出loss值不断震荡,忽高忽低.最严重的情况,有可能永远不会达到最/极小值,甚至跳出这个范围,进入另一个下降区域.

Andrew提供一些practice的LR选取方法,比如0.001, 0.003, 0.01, 0.03, 0.1等.

1. 小的LR会更精确
2. 大的LR的loss下降更快

Andrew: 在梯度下降法中,当我们接近局部最低点时,梯度下降法会自动采取更小的幅度. 这是因为当我们接近局部最低点时,很显然在局部最低时导数等于零,所以当我们接近局部最低时,导数值会自动变得越来越小.所以梯度下降将自动采取较小的幅度,这就是梯度下降的做法. 所以实际上没有必要再另外减小α,这就是梯度下降算法.你可以用它来最小化任何代价函数, 不只是线性回归中的代价函数J

### Decay Learning Rate

#### exponential_decay

LR指数衰减是最常用的衰减方法.

exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)
• learning rate 传入初始LR值
• global_step 用于计算衰减
• decay_steps 衰减的周期,每过decay_steps步后做一次衰减
• decay_rate 每次衰减倍率,用初始LR * decay_rate
• staircase 阶梯状衰减

decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)

...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100000, 0.96, staircase=True)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

#### inverse_time_decay

tf.train.inverse_time_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

decayed_learning_rate = learning_rate / (1 + decay_rate * t)

...
global_step = tf.Variable(0, trainable=False)
learning_rate = 0.1
k = 0.5
learning_rate = tf.train.inverse_time_decay(learning_rate, global_step, k)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

#### natural_exp_decay

tf.train.natural_exp_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)

...
global_step = tf.Variable(0, trainable=False)
learning_rate = 0.1
k = 0.5
learning_rate = tf.train.exponential_time_decay(learning_rate, global_step, k)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

#### piecewise_constant

tf.train.piecewise_constant(x, boundaries, values, name=None)

global_step = tf.Variable(0, trainable=False)
boundaries = [100000, 110000]
values = [1.0, 0.5, 0.1]
learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)

# Later, whenever we perform an optimization step, we increment global_step.

#### polynomial_decay

tf.train.polynomial_decay(learning_rate, global_step, decay_steps, end_learning_rate=0.0001, power=1.0, cycle=False, name=None)

polynomial_decay 是以多项式的方式衰减学习率的。

The function returns the decayed learning rate. It is computed as:

global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
(1 - global_step / decay_steps) ^ (power) +
end_learning_rate

If cycle is True then a multiple of decay_steps is used, the first one that is bigger than global_steps.

decay_steps = decay_steps * ceil(global_step / decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
(1 - global_step / decay_steps) ^ (power) +
end_learning_rate

...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
end_learning_rate = 0.01
decay_steps = 10000
learning_rate = tf.train.polynomial_decay(starter_learning_rate, global_step,
decay_steps, end_learning_rate,
power=0.5)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

cycle=False，其中红色线为 power=1，即线性下降；蓝色线为 power=0.5，即开方下降；绿色线为 power=2，即二次下降

### Experience

what is the lowest loss value can reach? Q: hi, I have trained a yolo-small model to step 4648, but most of loss values are greater than 1.0, and the result of test is not very well. I want to know how well can loss value be, and could you please show some key parameters when training, e.g learning rate, training time, the final loss value, and so on. A: What batch size are you using? Because without the batch size, step number cannot say anything about how far you’ve gone. According to the author of YOLO, he used pretty powerful machine and the training have two stages with the first stage (training convolution layer with average pool) takes about a week. So you should be patient if you’re not that far from the beginning. Training deep net is more of an art than science. So my suggestion is you first train your model on a small data size first to see if the model is able to overfit over training set, if not then there’s a problem to solve before proceeding. Notice due to data augmentation built in the code, you can’t really reach 0.0 for the loss. I’ve trained a few configs on my code and the loss can shrink down well from > 10.0 to around 0.5 or below (parameters C, B, S are not relevant since the loss is averaged across the output tensor). I usually start with default learning rate 1e-5, and batch size 16 or even 8 to speed up the loss first until it stops decreasing and seem to be unstable. Then, learning rate will be decreased down to 1e-6 and batch size increase to 32 and 64 whenever I feel that the loss get stuck (and testing still does not give good result). You can switch to other adaptive learning rate training algorithm (e.g. Adadelta, Adam, etc) if you feel like familiar with them by editing ./yolo/train.py/yolo_loss() You can also look at the learning rate policy the YOLO author used, inside .cfg files. Best of luck

66 篇文章16 人订阅

0 条评论

## 相关文章

44260

11820

341110

### 吴恩达斯坦福CS230第一名：图像超级补全，效果惊艳（附代码）

【新智元导读】图像修复（Image inpainting）是一个已经被广泛研究的计算机视觉问题，即恢复图像中缺失的部分。斯坦福大学CS230课程的Mark Sa...

15730

45660

### AI 技术讲座精选：​通过学习Keras从零开始实现VGG网络

Keras代码示例多达数百个。通常我们只需复制粘贴代码，而无需真正理解这些代码。通过学习本教程，您将搭建非常简单的构架，但是此过程会带给您些许好处：您将通过阅读...

39580

### 大会 | DiracNets：无需跳层连接的ResNet

AI 科技评论按：本文作者 David 9，首发于作者的个人博客，AI 科技评论获其授权转载。 虚拟化技术牺牲硬件开销和性能，换来软件功能的灵活性；深度模型也类...

39160

16520

37960

22650