To establish notation for future use, we’ll use
A pair (x(i),y(i)) is called a training example
the dataset that we’ll be using to learn—a list of m training examples (x(i),y(i));i=1,...,m—is called a training set.
the superscript “(i)” in the notation is simply
an index into the training set, and has nothing to do with exponentiation
In this example
X = Y = R
To describe the
supervised learning problem slightly more formally, our goal is,
given a training set, to learn a
function h : X → Yso that
h(x) is a “good” predictor for the corresponding value of y.
For historical reasons, this function
h is called a
hypothesis. Seen pictorially, the process is therefore like this
continuous, such as in our housing example
We can measure the accuracy of our hypothesis function by using a
This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.
To break it apart, it is 1/2 x ̄ where x ̄ is the mean of the squares of hθ(xi)−yi , or the difference
between the predicted value and the actual value.
This function is otherwise called the
Squared error function, or
Mean squared error.
The mean is halved (1/2)as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.
The following image summarizes what the cost function does: