用假设检验比较模型

老齐

发布于 2021-10-11 17:09:29

9750

发布于 2021-10-11 17:09:29

★本文是《机器学习数学基础》（计划11月出版）的书稿节选。先睹为快。关于本书的更多内容，请参阅：http://math.itdiffer.com/ ”

假设检验在机器学习中的应用广度和深度，以及如何应用，直到现在还有很多可探讨的内容。这里以Dietterich提出的“5x2cv配对t检验法”为例，简要介绍如何运用假设检验方法比较两个机器学习模型（ Dietterich TG (1998) Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput 10:1895–1923.）

下面的程序中创建了两个分类模型LogisticRegression()和DecisionTreeClassifier()，并用鸢尾花数据集进行训练和测试，然后评估每个模型的预测准确率。

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
model1 = LogisticRegression(random_state=1)
model2 = DecisionTreeClassifier(random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=123)

score1 = model1.fit(X_train, y_train).score(X_test, y_test)
score2 = model2.fit(X_train, y_train).score(X_test, y_test)

print(f'Logistic regression accuracy: {score1*100:.2f}%')
print(f'Decision tree accuracy: {score2*100:.2f}%')

# 输出
Logistic regression accuracy: 97.37%
Decision tree accuracy: 94.74%

输出结果显示，LogisticRegression()模型的预测准确率更高，这是不是就意味着两个模型有显著差异呢？非也！预测准确率只是某个模型对当前测试集的直观表现罢了。

Dietterich认为，如果用上述程序中所划分的训练集来训练模型，并用上述程序中的测试集进行测试，再用配对t检验对两个模型的差异性进行检验，会导致更大的概率犯第一类错误。针对这个问题，他提出了“5x2cv配对t检验法”（5x2cv paired t test）。下面对此方法给予简单介绍，更完整的内容，请参考前述文献。

用

折交叉验证方法划分数据集（参阅：https://en.wikipedia.org/wiki/Cross-validation_(statistics)），并用于训练和测试模型（称为“执行”），设执行

次。对于某次执行

j，(1\le j \le r)

，数据集被划分为

个等大的子集，某一个子集记作

，

1\le i \le k

。假设有两个机器学习模型A和B，用所划分的数据集在某次执行中，得到的预测准确率分别记作

a_{ij}^1

和

b^1_{ij}

，它们的差为

x^1_{ij} = a^1_{ij}-b^1_{ij}

。

然后将这些划分的数据集“角色”互换，即原训练集变成新测试集、原测试集变成新训练集，再执行，同理可得：

x^2_{ij} = a^2_{ij}-b^2_{ij}

。

用

\overline{x}_{ij}

表示

x_{ij}^1

和

x_{ij}^2

的平均值，

\overline x_{ij} = \frac{x_{ij}^1+x_{ij}^2}{2}

，方差为

s^2_{ij} = (x^1_{ij}-\overline{x}_{ij})^2+ (x^2_{ij}-\overline{x}_{ij})^2

。于是得到5x2cv配对t检验法的检验统计量：

下面的程序就是依据上述原理，在显著水平

\alpha=0.05

下，检验

H_0: 两模型无差异, \quad H1:两模型有差异

from mlxtend.evaluate import paired_ttest_5x2cv
t, p = paired_ttest_5x2cv(estimator1=model1, 
                          estimator2=model2, 
                          X=X, y=y, 
                          scoring='accuracy', 
                          random_seed=1)

print(f'P-value: {p:.3f}, t-Statistic: {t:.3f}')

if p <= 0.05:
    print('Difference between mean performance is probably real')
else:
    print('Algorithms probably have the same performance')
    
# 输出
P-value: 0.416, t-Statistic: -0.886
Algorithms probably have the same performance

结果显示，

p \gt \alpha

，不能拒绝原假设。

为了演示效果，下面对DecisionTreeClassifier()模型的参数进行调整，使其仅实现最简单的决策。

model3 = DecisionTreeClassifier(max_depth=1)
score3 = model3.fit(X_train, y_train).score(X_test, y_test)

print(f'Decision tree accuracy: {score3*100:.2f}%')

t, p = paired_ttest_5x2cv(estimator1=model1, 
                          estimator2=model3, 
                          X=X, y=y, 
                          scoring='accuracy', 
                          random_seed=1)

print(f'P-value: {p:.3f}, t-Statistic: {t:.3f}')

if p <= 0.05:
    print('Difference between mean performance is probably real')
else:
    print('Algorithms probably have the same performance')
    
# 输出
Decision tree accuracy: 63.16%
P-value: 0.001, t-Statistic: 7.269
Difference between mean performance is probably real

显然model1与model3有了显著差异。

选择机器学习模式，是根据它们的平均性能而定，但我们不知道不同模型之间的真实差异，这就要用假设检验实现了。以上用 Dietterich 提出的方法为例说明了选择模型的方法，对这方面的研究至今仍然在继续，有兴趣的读者可以阅读：https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/。

点击【阅读原文】，查阅更多《机器学习数学基础》内容

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-09-14，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习