我正在用Python计算Davies-Bouldin指数。
下面是下面代码试图再现的步骤。
5步
然后,
最后,
码
def daviesbouldin(X, labels, centroids):
import numpy as np
from scipy.spatial.distance import pdist, euclidean
nbre_of_clusters = len(centroids) #Get the number of clusters
distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
distances_means = [] #Store the mean of these distances
DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
second_cluster_idx = [] #Store index of the second cluster of each pair
first_cluster_idx = 0 #Set index of first cluster of each pair to 0
# Step 1: Compute euclidean distances between each point of a cluster to their centroid
for cluster in range(nbre_of_clusters):
for point in range(X[labels == cluster].shape[0]):
distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
# Step 2: Compute the mean of these distances
for e in distances:
distances_means.append(np.mean(e))
# Step 3: Compute euclidean distances between each pair of centroid
ctrds_distance = pdist(centroids)
# Tricky step 4: Compute Davies-Bouldin index of each pair of cluster
for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
second_cluster_idx.append(e)
if second_cluster_idx[i-1] == nbre_of_clusters - 1:
first_cluster_idx += 1
DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
# Step 5: Compute the mean of all DB_indexes
print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))
在论点中:
X
是数据labels
,是由聚类算法(即: kmeans)计算的标签。centroids
是每个集群的质心坐标(即:cluster_centers_
)。另外,请注意,我使用的是Python 3
QUESTION1:每对质心之间欧氏距离的计算正确吗(步骤3)?
QUESTION2:我对步骤4的实现正确吗?
QUESTION3:我需要规范簇内和簇间距离吗?
对步骤4的进一步解释
假设我们有10个星系团。循环应该计算每对集群的DB索引。
在第一次迭代时:
distances_means
指数0)和簇内平均(distances_means
指数1)的距离内平均值ctrds_distance
的索引0)。在第二次迭代中:
distances_means
指数0)和簇内平均(distances_means
指数2)的距离内平均值ctrds_distance
的索引1)。等等..。
通过10个集群的例子,整个迭代过程应该如下所示:
intra-cluster distance intra-cluster distance distance between their
of cluster: of cluster: centroids(storage num):
0 + 1 / 0
0 + 2 / 1
0 + 3 / 2
0 + 4 / 3
0 + 5 / 4
0 + 6 / 5
0 + 7 / 6
0 + 8 / 7
0 + 9 / 8
1 + 2 / 9
1 + 3 / 10
1 + 4 / 11
1 + 5 / 12
1 + 6 / 13
1 + 7 / 14
1 + 8 / 15
1 + 9 / 16
2 + 3 / 17
2 + 4 / 18
2 + 5 / 19
2 + 6 / 20
2 + 7 / 21
2 + 8 / 22
2 + 9 / 23
3 + 4 / 24
3 + 5 / 25
3 + 6 / 26
3 + 7 / 27
3 + 8 / 28
3 + 9 / 29
4 + 5 / 30
4 + 6 / 31
4 + 7 / 32
4 + 8 / 33
4 + 9 / 34
5 + 6 / 35
5 + 7 / 36
5 + 8 / 37
5 + 9 / 38
6 + 7 / 39
6 + 8 / 40
6 + 9 / 41
7 + 8 / 42
7 + 9 / 43
8 + 9 / 44
这里的问题是,我不太确定distances_means
的索引是否与ctrds_distance
的索引匹配。
换句话说,我不确定计算出的第一组间距离是否对应于簇1和簇2之间的距离,而计算出的第二组间距离是否对应于簇3和簇1之间的距离.依此类推,遵循上述模式。
简而言之:我恐怕是将簇内距离的对除以不相对应的簇间距离。
发布于 2018-01-10 14:07:05
下面是上述Davies索引朴素实现的一个更短、更快的修正版本。
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
return(np.max(db) / n_cluster)
回答我自己的问题:
注意,您可以找到尝试改进此索引的创新方法,特别是以圆柱距离代替欧几里得距离的"新版本的Davies-Bouldin索引“。
发布于 2018-01-25 16:47:09
谢谢你的实施。我只想问一个问题:在最后一排中有没有漏掉一个除法。在最后一步中,max(db)的值应除以已实现的簇数。
def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels))
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))
也许我是那个部门的主管,因为我是Python新手。但是在我的图形中(我正在遍历一系列集群),DB.max的值在开始时非常低,之后会增加。在按簇数进行缩放之后,图看起来更好(开始时DB.max值很高,并且随着簇数的增加而不断下降)。
诚挚的问候
发布于 2018-06-10 10:37:25
谢谢你的代码和修改-真的帮助我开始了。更短、更快的版本并不完全正确。我对其进行了修正,以正确地平均每个集群中最相似的集群的分散分数。
有关原始算法和解释,请参见测量:
DBI是每个集群与其最相似的集群的相似性度量的平均值。
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
# calculate cluster dispersion
S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
Ri = []
for i in range(n_cluster):
Rij = []
# establish similarity between each cluster and all other clusters
for j in range(n_cluster):
if j != i:
r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
Rij.append(r)
# select Ri value of most similar cluster
Ri.append(max(Rij))
# get mean of all Ri values
dbi = np.mean(Ri)
return dbi
https://stackoverflow.com/questions/48036593
复制相似问题