前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Optimizing the number of centroids最优化形心数量

Optimizing the number of centroids最优化形心数量

作者头像
到不了的都叫做远方
修改2020-04-23 15:06:37
4840
修改2020-04-23 15:06:37
举报

Centroids are difficult to interpret, and it can also be very difficult to determine whether we have the correct number of centroids. It's important to understand whether your data is unlabeled or not as this will directly influence the evaluation measures we can use.

形心比较难解释,也很难确定我们选的形心数量是否合适,重要的是你的数据是否是无标签的,这直接影响我们使用的评估方法。

Getting ready准备工作

Evaluating the model performance for unsupervised techniques is a challenge. Consequently,sklearn has several methods to evaluate clustering when a ground truth is known, and very few for when it isn't.We'll start with a single cluster model and evaluate its similarity. This is more for the purpose of mechanics as measuring the similarity of one cluster count is clearly not useful in finding the ground truth number of clusters.

评估无监督技术的模型的表现是一个挑战,因此,当分类准确性已知的情况下,sklearn有几种方法来评估聚类,但是不知道时,就只有很少的方法了。我们将从一个聚类模型学习并评估它的相似的地方。对于机器来说,测量一个聚类方法的相似性对于寻找更多模型的分类准确性并不是很有用。

How to do it…怎么做

To get started we'll create several blobs that can be used to simulate clusters of data:为了模拟能够被使用的聚类数据的区块,我们将生成几个团状数据

代码语言:javascript
复制
from sklearn.datasets import make_blobs
import numpy as np
blobs, classes = make_blobs(500, centers=3)
from sklearn.cluster import KMeans
kmean = KMeans(n_clusters=3)
kmean.fit(blobs)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

First, we'll look at silhouette distance. Silhouette distance is the ratio of the difference between in-cluster dissimilarity, the closest out-of-cluster dissimilarity, and the maximum of these two values. It can be thought of as a measure of how separate the clusters are.

首先,我们看一下轮廓的距离,轮廓距离是非相似的被聚类的结果和最近的非相似的未被聚类的结果的二者间的最大值,之间不同点的比率。(就是被选中为某个聚类的半径,和最近的那个未被选中的值的半径,两者的最大值)

Let's look at the distribution of distances from the points to the cluster centers; it's useful to understand silhouette distances:让我们看一下分类中心到各个点的距离的分布情况,这对于理解轮廓距离非常有用。

代码语言:javascript
复制
from sklearn import metrics
silhouette_samples = metrics.silhouette_samples(blobs,kmean.labels_)
np.column_stack((classes[:5], silhouette_samples[:5]))
array([[0.        , 0.64137447],
       [1.        , 0.82054529],
       [2.        , 0.5215416 ],
       [0.        , 0.6496082 ],
       [1.        , 0.75946336]])
f, ax = plt.subplots(figsize=(10, 5))
ax.hist(silhouette_samples)
ax.set_title("Hist of Silhouette Samples")

The following is the output:如下图所示

Notice that generally the higher the number of coefficients are closer to 1 (which is good) the better the score.

注意到总体较高的系数的数量都接近1,越近越高。

How it works…怎么工作的

The average of the silhouette coefficients is often used to describe the entire model's fit:

轮廓系数的平均值经常被用于描述整个模型的拟合情况。

代码语言:javascript
复制
silhouette_samples.mean()
0.6040968760162471

It's very common; in fact, the metrics module exposes a function to arrive at the value we just got:

这很常见,事实上,度量指标模型揭示了我们想要的值的函数

代码语言:javascript
复制
metrics.silhouette_score(blobs, kmean.labels_)
0.6040968760162471

Now, let's fit the models of several cluster counts and see what the average silhouette score looks like:

现在,让我们拟合几种聚类模型来看看轮廓得分的平均值会有怎样的变化。

# first new ground truth首先新的分类准确性

代码语言:javascript
复制
>>> blobs, classes = make_blobs(500, centers=10)
>>> sillhouette_avgs = []

# this could take a while这将花费一定时间

代码语言:javascript
复制
for k in range(2, 60):
    kmean = KMeans(n_clusters=k).fit(blobs)
    sillhouette_avgs.append(metrics.silhouette_score(blobs,kmean.labels_))
f, ax = plt.subplots(figsize=(7, 5))
ax.plot(sillhouette_avgs)

The following is the output:输出如下

This plot shows that the silhouette averages as the number of centroids increase. We can see that the optimum number, according to the data generating process, is 3, but here it looks like it's around 6 or 7. This is the reality of clustering; quite often, we won't get the correct numbe of clusters, we can only really hope to estimate the number of clusters to some approximation.

这幅图表现了轮廓均值随着形心数量增加的变化,我们能看到依靠数据生成的过程最优的数量是3,但是图里看起来是6或7,这就是聚类的现实,我们通常不能得到正确的聚类数量,我们只能期待估计一些近似的类别数量。

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档