我有分类数据,我正在尝试使用可用的GitHub包实现k模式。我正试图在我的(大型)数据集中创建集群,比如5-7个记录,每一个都是最相似的记录。
然而,到目前为止,我没有办法选择最优的'k‘,这将导致最大的轮廓得分,理想情况下。这将是理想的k模工作在不同/相似度量作为一个距离。因此,我假设剪影距离将根据这个不同定义的距离度量集群的距离,从而建立剪影评分。我找不到这方面的实现。
我可以用肘法吗?但是,如果不看图,我就无法理解如何以编程的方式确定这个过程,因为我必须多次重复这个过程。目前,一个想法是-寻找成本大幅下降的k。看看接下来的几个值是否降低了成本。如果是,选择这个作为k,如果不是..。然后呢?我在这一点上有点困惑。
我在网上查看,还发现了这,我无法用k模式来解释它。我正在寻找任何代码/建议,让我走上正确的道路。
发布于 2019-03-16 00:42:16
与其尝试找到一个下载源代码的地方,不如自己实现,例如,剪影。
你在博客和回复中找到的大量代码都被破坏了。
我见过很多github存储库都有错误的代码,像您这样的人都想知道为什么它不能工作。依靠匿名的别人不犯错误是个坏主意。在某种程度上,您最好自己编写代码!
当然,依赖大型开源项目(如sklearn、R、ELKI、Weka )是可以的。它们有代码评审,讨论拉请求,几十个人查看代码,使用它,试图查找和修复错误(但即使在代码中也有错误)。
发布于 2019-03-18 12:15:36
def matching_disimilarity(a, b):
return np.sum(a != b, axis=1)
silhouette_dict = dict()
cluster_labels = [...]
distinct_cluster_label_predictions = unique cluster_labels
for i in m_array:
other_records_in_cluster = m_array_(with cluster_prediction == cluster_prediction of i) - i
other_records_outside_cluster = m_array_(with cluster_prediction != cluster_prediction of i)
other_records_outside_cluster_labels = cluster labels of record in other_records_outside_cluster
sum_a = 0
sum_b = 0
sum_cluster_dist = dict()
avg_cluster_dist = dict()
for c in distinct_cluster_label_predictions:
sum_cluster_dist[c] = 0
# finding a(i) - for each observation i, calculate the average dissimilarity ai between i and all other
# points of the cluster to which i belongs.
for j in other_records_in_cluster:
sum_a += matching_disimilarity(i, j)
a = sum_a/len(other_records_in_cluster)
dict_b = dict()
# find average of inter-cluster distance with nearest neighbour
for j in other_records_outside_cluster:
dist_i_to_j = matching_disimilarity(i,j)
dict_b[j] = dist_i_to_j
sum_till_now = sum_cluster_dist[other_records_outside_cluster_labels[j]]
sum_cluster_dist[other_records_outside_cluster_labels[j]] = sum_till_now+dist_i_to_j
for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = sum_cluster_dist[c]/(length of elements_belonging_to_c)
# nearest_neighbour is the with smallest average distance
# for more than one nearest neighbour? Break randomly?
nearest_cluster_label = key of minimum avg_cluster_dist value
neighbouring_cluster_records = records with cluster_prediction == nearest_cluster_label
for k in neighbouring_cluster_records:
sum_b += dict_b[k]
b = sum_b/len(neighbouring_cluster_records)
if (a
发布于 2020-03-03 15:33:45
通常,您将选择与最高轮廓值相关联的集群数量,但这可能很棘手,因为X和Y集群之间的剪影值之间的差异可以非常微不足道。你试过制作剪影情节吗?剪影图将允许您在a -1到1的比例上,将集群数据与它们指定的集群相近程度可视化,并在垂直轴上显示集群数。
https://datascience.stackexchange.com/questions/47373
复制相似问题