# 机器学习基本概念-2

## Capacity

The ability to perform well on previously unobserved inputs is called generalization.

The ability of the learner (or called model) to discover a function taken from a family of functions.

1. Linear predictor y = wx + b
2. Quadratic predictor y = W2X^2 + W1X + b
3. Degree-4 polynomial predictor y = b+W1x+W2X^2+W3X^3+W4X^4

Capacity can be measured by the number of training examples {Xi , Yi} that the learner could always fit, no matter how to change the values of Xi and Yi.

## Underfitting && Overfitting

Training: estimate the parameters of f from training examples {Xi , Yi)}.

Training中，我们会定义一个`cost function`或者叫`objective function`，训练中我们需要做的就是最优化次此函数，此时会产生`training error`. 衡量model的generalization，我们会让model作用在一个单独的set上，叫做`test set`，同样的也会产生`generalization error`或者叫`test error`.

1. Make the training error small.
2. Make the gap between training and test error small.

Underfitting就是learner不能有效的学习到training examples的特征. Overfitting就是learning对于training examples fits well，但是泛化能力很弱.

Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. Overfitting occurs when the gap between the training error and test error is too large.

model的capacity小的话，可能会出现Underfitting的状况; model的capacity太大的话，可能会出现Overfitting的状况.

Models with low capacity may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set.

Overfitting的话,其典型表现如下:

• Model is not rich enough.
• Difficult to find the global optimum of the objective function on the training set or easy to get stuck at local minimum.
• Limitation on the computation resources (not enough training iterations of an iterative optimization procedure).

• The family of functions is too large (compared with the size of the training data) and it contains many functions which all fit the training data well.
• Without sufficient data, the learner cannot distinguish which one is most appropriate and would make an arbitrary choice among these apparently good solutions.
• In most cases, data is contaminated by noise. The learner with large capacity tends to describe random errors or noise instead of the underlying models of data (classes).

optimal capacity 此时test error和training error的gap最小.

• Reduce the number of features.
• Reduce the number of independent parameters.
• Add regularization to the learner.
• Reduce the network size of deep models.
• Reduce the number of training iterations.

