# [MachineLearning] 超参数之LearningRate

### Gardient Descent

• := 是赋值操作
• $J(\theta_{0},\theta_{1})$是代价函数
• $\alpha$是learning rate,它控制我们以多大的幅度更新这个参数$\theta_{J}$. 当偏导数部分为0时,即已经到达极小值,梯度便不再下降.这也说明$\alpha$保持不变时,梯度下降也可以收敛到局部最低点
• 梯度下降操作是同时更新 $\theta_{0}$和$\theta_{1}$的.

SGD和minibatch-SGD Stochastic Gradient Descent是随机梯度下降,每次计算只用一个随机样本 minibatch-SGD 一次采用batch size的样本做梯度

### Learning rate

• LR很小时,训练会变得可靠,也就是说梯度会向着最/极小值一步步靠近.算出来的loss会越来越小.但代价是,下降的速度很慢,训练时间会很长.
• LR很大时,训练会越过最/极小值,表现出loss值不断震荡,忽高忽低.最严重的情况,有可能永远不会达到最/极小值,甚至跳出这个范围,进入另一个下降区域.

Andrew提供一些practice的LR选取方法,比如0.001, 0.003, 0.01, 0.03, 0.1等.

1. 小的LR会更精确
2. 大的LR的loss下降更快

Andrew: 在梯度下降法中,当我们接近局部最低点时,梯度下降法会自动采取更小的幅度. 这是因为当我们接近局部最低点时,很显然在局部最低时导数等于零,所以当我们接近局部最低时,导数值会自动变得越来越小.所以梯度下降将自动采取较小的幅度,这就是梯度下降的做法. 所以实际上没有必要再另外减小α,这就是梯度下降算法.你可以用它来最小化任何代价函数, 不只是线性回归中的代价函数J

### Decay Learning Rate

#### exponential_decay

LR指数衰减是最常用的衰减方法.

exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)
• learning rate 传入初始LR值
• global_step 用于计算衰减
• decay_steps 衰减的周期,每过decay_steps步后做一次衰减
• decay_rate 每次衰减倍率,用初始LR * decay_rate
• staircase 阶梯状衰减

decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)

...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100000, 0.96, staircase=True)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

#### inverse_time_decay

tf.train.inverse_time_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

decayed_learning_rate = learning_rate / (1 + decay_rate * t)

...
global_step = tf.Variable(0, trainable=False)
learning_rate = 0.1
k = 0.5
learning_rate = tf.train.inverse_time_decay(learning_rate, global_step, k)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

#### natural_exp_decay

tf.train.natural_exp_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)

...
global_step = tf.Variable(0, trainable=False)
learning_rate = 0.1
k = 0.5
learning_rate = tf.train.exponential_time_decay(learning_rate, global_step, k)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

#### piecewise_constant

tf.train.piecewise_constant(x, boundaries, values, name=None)

global_step = tf.Variable(0, trainable=False)
boundaries = [100000, 110000]
values = [1.0, 0.5, 0.1]
learning_rate = tf.train.piecewise_constant(global_step, boundaries, values)

# Later, whenever we perform an optimization step, we increment global_step.

#### polynomial_decay

tf.train.polynomial_decay(learning_rate, global_step, decay_steps, end_learning_rate=0.0001, power=1.0, cycle=False, name=None)

polynomial_decay 是以多项式的方式衰减学习率的。

The function returns the decayed learning rate. It is computed as:

global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
(1 - global_step / decay_steps) ^ (power) +
end_learning_rate

If cycle is True then a multiple of decay_steps is used, the first one that is bigger than global_steps.

decay_steps = decay_steps * ceil(global_step / decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
(1 - global_step / decay_steps) ^ (power) +
end_learning_rate

...
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
end_learning_rate = 0.01
decay_steps = 10000
learning_rate = tf.train.polynomial_decay(starter_learning_rate, global_step,
decay_steps, end_learning_rate,
power=0.5)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step)
)

cycle=False，其中红色线为 power=1，即线性下降；蓝色线为 power=0.5，即开方下降；绿色线为 power=2，即二次下降

### Experience

what is the lowest loss value can reach? Q: hi, I have trained a yolo-small model to step 4648, but most of loss values are greater than 1.0, and the result of test is not very well. I want to know how well can loss value be, and could you please show some key parameters when training, e.g learning rate, training time, the final loss value, and so on. A: What batch size are you using? Because without the batch size, step number cannot say anything about how far you’ve gone. According to the author of YOLO, he used pretty powerful machine and the training have two stages with the first stage (training convolution layer with average pool) takes about a week. So you should be patient if you’re not that far from the beginning. Training deep net is more of an art than science. So my suggestion is you first train your model on a small data size first to see if the model is able to overfit over training set, if not then there’s a problem to solve before proceeding. Notice due to data augmentation built in the code, you can’t really reach 0.0 for the loss. I’ve trained a few configs on my code and the loss can shrink down well from > 10.0 to around 0.5 or below (parameters C, B, S are not relevant since the loss is averaged across the output tensor). I usually start with default learning rate 1e-5, and batch size 16 or even 8 to speed up the loss first until it stops decreasing and seem to be unstable. Then, learning rate will be decreased down to 1e-6 and batch size increase to 32 and 64 whenever I feel that the loss get stuck (and testing still does not give good result). You can switch to other adaptive learning rate training algorithm (e.g. Adadelta, Adam, etc) if you feel like familiar with them by editing ./yolo/train.py/yolo_loss() You can also look at the learning rate policy the YOLO author used, inside .cfg files. Best of luck

