文章/答案/技术大牛

发布

社区首页 >问答首页 >计算聚类精度

问计算聚类精度
EN

Stack Overflow用户

提问于 2019-02-27 22:47:04

回答 1查看 1.5K关注 0票数 2

我想编写一个python代码来计算集群精度r，如下所示：

r = (A1+ ... +Ai+ ...Ak) / (the number of data objects)

其中Ai是在第一个集群中出现的数据对象的数量，也是它对应的真正集群中的数据对象数。

为了将聚类性能与使用这一准确性标准的研究论文进行比较，我需要实现它。

我在“滑雪”中寻找现有的方法，但找不到这样的方法，我试着自己写。

下面是我写的代码：

    # For each label in prediction, extract true labels of the same 
    # index as 'labels'. Then count the number of instances of respective
    # true labels in 'labels', and assume the one with the maximum 
    # number of instances is the corresponding true label.
    pred_to_true_conversion={}
    for p in np.unique(pred):
        labels=true[pred==p]
        unique, counts=np.unique(labels, return_counts=True)
        label_count=dict(zip(unique, counts))
        pred_to_true_conversion[p]=max(label_count, key=label_count.get)

    # count the number of instances whose true label is the same
    # as the converted predicted label.
    count=0
    for t, p in zip(true, pred):
        if t==pred_to_true_conversion[p]: count+=1

    return count/len(true)

但是，我不认为我的“标签重映射”方法是一种聪明的方法，应该有一个更好的方法来计算r。我的方法有一些问题，例如：

它依赖于这样一个假设，即对应的真实标签是在预测的集群中出现频率最高的标签，但情况并不总是如此。
不同的预测聚类标签与同一个真实的聚类标签相关，特别是当真实标签和预测标签中的类数不同时。

如何实现精度r？或者，在现有的集群库中是否存在这样的方法？

python

cluster-analysis

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-02-28 08:35:11

我相信你所描述的是我之前也想做的事情。我就是这样解决的：

from sklearn.metrics.cluster import contingency_matrix
from sklearn.preprocessing import normalize

normalize(contingency_matrix(labels_pred=pred, labels_true=true), norm='l1', axis=1)

这个矩阵给出了每个集群/标签组合的召回信息。

编辑：

你用这种方法说出的问题，我相信是它固有的。由于某些原因，有些论文更倾向于报告聚类结果的准确性或F测度，尽管它们并不十分适合。这论文使用了一种不同的方法来计算聚类结果的F-测度，这至少解决了多个聚类映射到一个单一真值标签的问题。他们使用任务分配算法来解决这个特定的问题。

这是我的匈牙利F1分数的代码：

from munkres import Munkres
def f_matrix(labels_pred, labels_true):
    # Calculate F1 matrix
    cont_mat = contingency_matrix(labels_pred=labels_pred, labels_true=labels_true)
    precision = normalize(cont_mat, norm='l1', axis=0)
    recall = normalize(cont_mat, norm='l1', axis=1)
    som = precision + recall
    f1 =  np.round(np.divide((2 * recall * precision), som, out=np.zeros_like(som), where=som!=0), 3)
    return f1

def f1_hungarian(f1):
    m = Munkres()
    inverse = 1 - f1
    indices = m.compute(inverse.tolist())
    fscore = sum([f1[i] for i in indices])/len(indices)
    return fscore
f1_hungarian(f_matrix(pred, true))

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54915736

复制

相似问题

问计算聚类精度
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算聚类精度EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算聚类精度
EN