# batch and minibatch

In practice, we can compute these expectations by randomly sampling a small number of examples from the dataset, then taking the average over only those examples.

Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they process all of the training examples simultaneously in a large batch.

Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online methods.

## minibatch SGD

SGD和batch GD是两个极端，一个每次只使用一个训练数据来计算梯度，一个是使用所有的训练数据；batch GD计算成本太高，SGD太具有随机性，所以需要综合下。

using more than one but less than all of the training examples. These were traditionally called minibatch or minibatch stochastic methods and it is now common to simply call them stochastic methods.

It is also crucial that the minibatches be selected randomly. Computing an unbiased estimate of the expected gradient from a set of samples requires that those samples be independent. We also wish for two subsequent gradient estimates to be independent from each other, so two subsequent minibatches of examples should also be independent from each other.

# SGD

SGD和它的变种是最常用的一阶优化算法，具体描述：

## momentum

SGD的问题就是它可能会很慢，所以使用momentum来加速学习过程.

The method of momentum (Polyak, 1964) is designed to accelerate learning, especially in the face of high curvature, small but consistent gradients, or noisy gradients.

momentum的原理:

The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.

determines how quickly the contributions of previous gradients exponentially decay.

• how large a sequence of gradients are.
• how aligned a sequence of gradients are.

step_size_with_momentum=lr/(1-α)*||g||，通常典型的α值是0.9

CS231n上有比较形象的解释: 地址在这 中文翻译过来就是：

## Nesterov momentum

Nesterov直接在前向位置(绿色箭头指向的位置)处更新梯度.

## SGD with nesterov momentum

individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values.

The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate.

the accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in the effective learning rate.

# RMSprop

RMSProp是从AdaGrad上修改来的，也是个自适应的算法，就是把gradient accumulation换成exponentially weighted moving average.

1. AdaGrad shrinks the learning rate according to the entire history of the squared gradient and may have made the learning rate too small before arriving at such a convex structure.
2. RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl.

## RMSProp with nesterov momentum

1. First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with exponential weighting) of the gradient.
2. Second, Adam includes bias corrections to the estimates of both the first-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origin.

# 总结

1. 优化算法有一阶和二阶算法
3. 二阶算法由于计算的代价等问题不常用，比如牛顿法, BFGS, L-BFGS等

89 篇文章41 人订阅

0 条评论

## 相关文章

1699

1052

34511

### [Tensorflow] Tensorflow卷积理解

CNN对于学习深度学习的人来说应该是比较耳熟的名词了.但很多人只是听过,但不知道是什么.

562

2355

2309

771

### MCMC(三)MCMC采样和M-H采样

在MCMC(二)马尔科夫链中我们讲到给定一个概率平稳分布$\pi$, 很难直接找到对应的马尔科夫链状态转移矩阵$P$。而只要解决这个问题，我们就可以找到...

1035

### 基于OpenCV实现手写体数字训练与识别

OpenCV实现手写体数字训练与识别 机器学习(ML)是OpenCV模块之一，对于常见的数字识别与英文字母识别都可以做到很高的识别率，完成这类应用的主要思想与方...

4176

2565