# 如何使用统计显着性检验来解释机器学习结果

• 如何应用正态性测试来确认您的数据是否正常分布。
• 如何对正态分布结果应用参数统计显着性检验。
• 如何将非参数统计显着性检验应用于更复杂的结果分布。

## 教程概述

1. 生成示例数据
2. 摘要统计
3. 正态性测试
4. 比较高斯结果的手段
5. 高斯结果与不同方差的比较均值
6. 比较非高斯结果的手段

## 生成示例数据

```from numpy.random import seed
from numpy.random import normal
from numpy import savetxt
# define underlying distribution of results
mean = 50
stev = 10
# generate samples from ideal distribution
seed(1)
results = normal(mean, stev, 1000)
# save to ASCII file
savetxt('results1.csv', results) ```

```6.624345363663240960e+01
4.388243586349924641e+01
4.471828247736544171e+01
3.927031377843829318e+01
5.865407629324678851e+01
...```

```from numpy.random import seed
from numpy.random import normal
from numpy import savetxt
# define underlying distribution of results
mean = 60
stev = 10
# generate samples from ideal distribution
seed(1)
results = normal(mean, stev, 1000)
# save to ASCII file
savetxt('results2.csv', results)```

```7.624345363663240960e+01
5.388243586349924641e+01
5.471828247736544171e+01
4.927031377843829318e+01
6.865407629324678851e+01
...
```

## 摘要统计

```from pandas import DataFrame
from matplotlib import pyplot
results = DataFrame()
# descriptive stats
print(results.describe())
# box and whisker plot
results.boxplot()
pyplot.show()
# histogram
results.hist()
pyplot.show()
```

```                 A            B
count  1000.000000  1000.000000
mean     50.388125    60.388125
std       9.814950     9.814950
min      19.462356    29.462356
25%      43.998396    53.998396
50%      50.412926    60.412926
75%      57.039989    67.039989
max      89.586027    99.586027```

A的效果比B好看。

## 正态性测试

```from pandas import read_csv
from scipy.stats import normaltest
from matplotlib import pyplot
value, p = normaltest(result1.values[:,0])
print(value, p)
if p >= 0.05:
print('It is likely that result1 is normal')
else:
print('It is unlikely that result1 is normal')
```

```2.99013078116 0.224233941463
It is likely that result1 is normal```

```from pandas import read_csv
from scipy.stats import normaltest
from matplotlib import pyplot
value, p = normaltest(result2.values[:,0])
print(value, p)
if p >= 0.05:
print('It is likely that result2 is normal')
else:
print('It is unlikely that result2 is normal')```

```2.99013078116 0.224233941463
It is likely that result2 is normal
```

## 比较高斯结果的均值

```from pandas import read_csv
from scipy.stats import ttest_ind
from matplotlib import pyplot
values1 = result1.values[:,0]
values2 = result2.values[:,0]
# calculate the significance
value, pvalue = ttest_ind(values1, values2, equal_var=True)
print(value, pvalue)
if pvalue > 0.05:
print('Samples are likely drawn from the same distributions (accept H0)')
else:
print('Samples are likely drawn from different distributions (reject H0)')
```

```-22.7822655028 2.5159901708e-102
Samples are likely drawn from different distributions (reject H0)```

## 高斯结果与不同方差的比较均值

```from numpy.random import seed
from numpy.random import normal
from scipy.stats import ttest_ind
# generate results
seed(1)
n = 100
values1 = normal(50, 1, n)
values2 = normal(51, 10, n)
# calculate the significance
value, pvalue = ttest_ind(values1, values2, equal_var=False)
print(value, pvalue)
if pvalue > 0.05:
print('Samples are likely drawn from the same distributions (accept H0)')
else:
print('Samples are likely drawn from different distributions (reject H0)')```

```-2.62233137406 0.0100871483783
Samples are likely drawn from different distributions (reject H0)```

```from numpy.random import seed
from numpy.random import normal
from scipy.stats import ttest_ind
from matplotlib import pyplot
# generate results
seed(1)
n = 100
values1 = normal(50, 1, n)
values2 = normal(51, 10, n)
# calculate p-values for different subsets of results
pvalues = list()
for i in range(1, n+1):
value, p = ttest_ind(values1[0:i], values2[0:i], equal_var=False)
pvalues.append(p)
# plot p-values vs number of results in sample
pyplot.plot(pvalues)
# draw line at 95%, below which we reject H0
pyplot.plot([0.05 for x in range(len(pvalues))], color='red')
pyplot.show()```

## 比较非高斯结果的手段

```from numpy.random import seed
from numpy.random import randint
from scipy.stats import ks_2samp
# generate results
seed(1)
n = 100
values1 = randint(50, 60, n)
values2 = randint(55, 65, n)
# calculate the significance
value, pvalue = ks_2samp(values1, values2)
print(value, pvalue)
if pvalue > 0.05:
print('Samples are likely drawn from the same distributions (accept H0)')
else:
print('Samples are likely drawn from different distributions (reject H0)')```

p值非常小，这表明几乎可以肯定的是两个人群之间的差异是显着的。

```0.47 2.16825856737e-10
Samples are likely drawn from different distributions (reject H0) ```

## 进一步阅读

• 维基百科上的正态性测试
• https://en.wikipedia.org/wiki/Normality_test
• 学生的维基百科的t检验
• https://en.wikipedia.org/wiki/Student's_t-test
• 韦尔奇的维基百科上的t检验
• https://en.wikipedia.org/wiki/Welch%27s_t-test
• 在维基百科的Kolmogorov-Smirnov测试
• https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

## 概要

• 如何使用常态测试来检查您的实验结果是否为高斯。
• 如何使用统计检验来检查平均结果之间的差异对于具有相同和不同方差的高斯数据是否显着。
• 如何使用统计测试来检查平均结果之间的差异是否对非高斯数据有意义。

0 条评论

## 相关文章

36610

982

33011

4857

### （数据科学学习手札36）tensorflow实现MLP

我们在前面的数据科学学习手札34中也介绍过，作为最典型的神经网络，多层感知机（MLP）结构简单且规则，并且在隐层设计的足够完善时，可以拟合任意连续函数，而除...

4584

3055

3653

2019

3106

3761