聚类模型评价（python实现）

三猫

发布于 2019-08-23 14:56:59

5.9K0

发布于 2019-08-23 14:56:59

文章被收录于专栏：机器学习养成记

在使用聚类方法的过程中，常常涉及到如何选择合适的聚类数目、如何判断聚类效果等问题，本篇文章我们就来介绍几个聚类模型的评价指标，并展示相关指标在python中的实现方法。

概述

评价指标分为外部指标和内部指标两种，外部指标指评价过程中需要借助数据真实情况进行对比分析的指标，内部指标指不需要其他数据就可进行评估的指标。下表中列出了几个常用评价指标的相关情况：

2 2

Python实现

轮廓系数（Silhouette Coefficient）

轮廓系数可以用来选择合适的聚类数目。根据折线图可直观的找到系数变化幅度最大的点，认为发生畸变幅度最大的点就是最好的聚类数目。

from sklearn.metrics import silhouette_score 
data2 = data1.sample(n=2000,random_state=123,axis=0) 
silhouettescore=[] 
for i in range(2,8):    
  kmeans=KMeans(n_clusters=i,random_state=123).fit(data2.iloc[:,1:4])    
  score=silhouette_score(data2.iloc[:,1:4],kmeans.labels_)    
  silhouettescore.append(score) 
plt.figure(figsize=(10,6)) 
plt.plot(range(2,8),silhouettescore,linewidth=1.5,linestyle='-') 
plt.show()

数目在2到3时畸变程度越大，因此选择2类较好。

Calinski-Harabaz 指数

Calinski-Harabaz指数也可以用来选择最佳聚类数目，且运算速度远高于轮廓系数，因此个人更喜欢这个方法。内部数据的协方差越小，类别之间的协方差越大时，Calinski-Harabasz分数越高。

from sklearn.metrics import calinski_harabaz_score
for i in range(2,7):
    kmeans=KMeans(n_clusters=i,random_state=123).fit(data2.iloc[:,1:4])
    score=calinski_harabaz_score(data2.iloc[:,1:4],kmeans.labels_)
    print('聚类%d簇的calinski_harabaz分数为：%f'%(i,score))
#聚类2簇的calinski_harabaz分数为：3535.009345
#聚类3簇的calinski_harabaz分数为：3153.860287
#聚类4簇的calinski_harabaz分数为：3356.551740
#聚类5簇的calinski_harabaz分数为：3145.500663
#聚类6簇的calinski_harabaz分数为：3186.529313

可见，分为两类的值最高，结论与上面的轮廓系数判断方法一致。

调整兰德系数（Adjusted Rand index，ARI）

从兰德系数开始，为外部指标。兰德系数用来衡量两个分布的吻合程度，取值范围[-1,1],数值越接近于1越好，并且在聚类结果随机产生时，指标接近于0。为方便演示，省去聚类过程，直接用样例数据展示实现方法。

from sklearn.metrics import adjusted_rand_score
labels_true = [0, 0, 1, 1, 0, 1]
labels_pred = [0, 0, 1, 1, 1, 2]
ari=adjusted_rand_score(labels_true, labels_pred)  
print('兰德系数为：%f'%(ari))
#兰德系数为：0.117647

互信息（Adjusted Mutual Information，AMI）

互信息也是用来衡量两个分布的吻合程度，取值范围[-1,1]，值越大聚类效果与真实情况越吻合。

from sklearn.metrics import adjusted_mutual_info_score
labels_true = [0, 0, 1, 1, 0, 1]
labels_pred = [0, 0, 1, 1, 1, 2]
ami=adjusted_mutual_info_score(labels_true, labels_pred) 
print('互信息为：%f'%(ami))
#互信息为：0.225042

V-measure

说V-measure之前要先介绍两个指标：

同质性（homogeneity）：每个群集只包含单个类的成员。
完整性（completeness）：给定类的所有成员都分配给同一个群集。

V-measure是两者的调和平均。V-measure取值范围为 [0,1]，越大越好，但当样本量较小或聚类数据较多的情况，推荐使用AMI和ARI。

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
h_score=metrics.homogeneity_score(labels_true, labels_pred)
c_score=metrics.completeness_score(labels_true, labels_pred) 
V_measure=metrics.v_measure_score(labels_true, labels_pred)    
print('h_score为：%f \nc_score为：%f \nV_measure为：%f'%(h_score,c_score,V_measure))
#h_score为：0.666667 
#c_score为：0.420620 
#V_measure为：0.515804

Fowlkes-Mallows Index（FMI）

FMI是对聚类结果和真实值计算得到的召回率和精确率，进行几何平均的结果，取值范围为 [0,1]，越接近1越好。

from sklearn.metrics import fowlkes_mallows_score
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
fmi=fowlkes_mallows_score(labels_true, labels_pred)  
print('FMI为：%f'%(fmi))
#FMI为：0.471405

一般情况下，主要是对无y值的数据进行聚类操作。如果在评价中用到外部指标，就需通过人工标注等方法获取y值，成本较高，因此内部指标的实际实用性更强。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-08-21，如有侵权请联系 cloudcommunity@tencent.com 删除

python

本文分享自机器学习养成记微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

python

登录后参与评论

0 条评论

热度

聚类模型评价（python实现）

聚类模型评价（python实现）

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐