前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Regression model evaluation回归模型评估

Regression model evaluation回归模型评估

作者头像
到不了的都叫做远方
修改2020-05-07 14:12:57
6190
修改2020-05-07 14:12:57
举报

We learned about quantifying the error in classification, now we'll discuss quantifying the error for continuous problems. For example, we're trying to predict an age, not a gender.

我们在分类算法中学习了量化误差,现在我们将讨论量化连续型问题的误差,例如,我们将尝试预测年龄而不是性别

Getting ready准备工作

Like the classification, we'll fake some data, then plot the change. We'll start simple, then build up the complexity. The data will be a simulated linear model:

如分类问题一样,我们先创造些数据,然后画出其中的变化,我们从简单开始,然后建立复杂的问题。数据将是模拟的线性模型。

m = 2 b = 1 y = lambda x: m*x+b

Also, let's get our modules loaded:同时,我们导入模型:

代码语言:javascript
复制
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics

How to do it...怎么做

We will be performing the following actions:我们将执行以下操作:

1. Use 'y' to generate 'y_actual' .使用y来生成‘y_actual’

2. Use 'y_actual' plus some err to generate 'y_prediction' .使用‘y_actual’加一些误差来生成‘y_prediction’

3. Plot the differences.画出不同

4. Walk through various metrics and plot some of them.了解多种指标并画出一部分。

Let's take care of steps 1 and 2 at the same time and just have a function do the work for us.This will be almost the same thing we just saw, but we'll add the ability to specify an error (or bias if a constant):

让我们同时关心第一步与第二步并且恰好有一个函数能帮助我们做这个。这也和我们以前见到的一样,但是我们要增加解释误差的能力,或者常数的偏差。

代码语言:javascript
复制
def data(x, m=2, b=1, e=None, s=10):
"""
Args:
x: The x value
m: Slope斜率
b: Intercept截距
e: Error, optional, True will give random error
"""
    if e is None:
        e_i = 0
    elif e is True:
        e_i = np.random.normal(0, s, len(xs))
    else:
        e_i = e
    return x * m + b + e_i

Now that we have the function, let's define y_hat and y_actual . We'll do it in a convenient way:

现在我们有了函数,让我们定义y_hat 和y_actual。我们将用简便方法

代码语言:javascript
复制
from functools import partial
N = 100
xs = np.sort(np.random.rand(N)*100)
y_pred_gen = partial(data, x=xs, e=True)
y_true_gen = partial(data, x=xs)
y_pred = y_pred_gen()
y_true = y_true_gen()
f, ax = plt.subplots(figsize=(7, 5))
ax.set_title("Plotting the fit vs the underlying process.")
ax.scatter(xs, y_pred, label=r'$\hat{y}$')
ax.plot(xs, y_true, label=r'$y$')
ax.legend(loc='best')

The output for this code is as follows:代码的输出如下

Just to confirm the output, we'd be working with the classical residuals:为了确定输出,我们需要研究经典的残差:

代码语言:javascript
复制
e_hat = y_pred - y_true
f, ax = plt.subplots(figsize=(7, 5))
ax.set_title("Residuals")
ax.hist(e_hat, color='r', alpha=.5, histtype='stepfilled')
(array([ 5.,  6.,  9., 13., 18., 20., 13.,  7.,  7.,  2.]),
 array([-18.80786995, -14.50951001, -10.21115007,  -5.91279014,
         -1.6144302 ,   2.68392973,   6.98228967,  11.28064961,
         15.57900954,  19.87736948,  24.17572942]),
 <a list of 1 Patch objects>)

The output for the residuals is as follows:输出如下所示

So that looks good now.这看起来不错

How it works...如何运行的

Now let's move to the metrics.让我们移动到指标

First, a metric is the mean squared error:首先一个指标是均方误差:

You can use the following code to find the value of the mean squared error:我们能使用以下代码来得到均方误差的值:

代码语言:javascript
复制
metrics.mean_squared_error(y_true, y_pred)
89.20901624683765

You'll notice that this code will penalize large errors more than small errors. It's important to remember that all we're doing here is applying what probably was the cost function for the model on the test data.

你讲注意到这个代码将对较大误差进行更大的惩罚相对于较小误差来说。需要牢记的是我们在测试数据集上对模型使用了代价函数。

Another option is the mean absolute deviation. We need to take the absolute value of the difference, if we don't, our value will probably be fairly close to zero, the mean of the distribution:

其他选项是平均绝对误差,我们需要使用差值的绝对值,如果我们不这样做,我们的值可能会接近与0.分布在均值附近。

The final option is R 2 , this is 1 minus the ratio of squared errors for the overall mean and the fit model. As the ratio tends to 0, the R 2 tends to 1:

最终的选项是R2,这是1减去所有均值的误差平方率,来拟合模型,如果比率接近于0,那么R2趋向于1:

代码语言:javascript
复制
metrics.r2_score(y_true, y_pred)
0.9729558770974368

R 2 is deceptive; it cannot give the clearest sense of the accuracy of the model.

R2是带有欺骗性的,它不能给出模型正确率的最清晰的情况。

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档