专栏首页翻译scikit-learn CookbookOptimizing the number of centroids最优化形心数量

Optimizing the number of centroids最优化形心数量

Centroids are difficult to interpret, and it can also be very difficult to determine whether we have the correct number of centroids. It's important to understand whether your data is unlabeled or not as this will directly influence the evaluation measures we can use.

形心比较难解释,也很难确定我们选的形心数量是否合适,重要的是你的数据是否是无标签的,这直接影响我们使用的评估方法。

Getting ready准备工作

Evaluating the model performance for unsupervised techniques is a challenge. Consequently,sklearn has several methods to evaluate clustering when a ground truth is known, and very few for when it isn't.We'll start with a single cluster model and evaluate its similarity. This is more for the purpose of mechanics as measuring the similarity of one cluster count is clearly not useful in finding the ground truth number of clusters.

评估无监督技术的模型的表现是一个挑战,因此,当分类准确性已知的情况下,sklearn有几种方法来评估聚类,但是不知道时,就只有很少的方法了。我们将从一个聚类模型学习并评估它的相似的地方。对于机器来说,测量一个聚类方法的相似性对于寻找更多模型的分类准确性并不是很有用。

How to do it…怎么做

To get started we'll create several blobs that can be used to simulate clusters of data:为了模拟能够被使用的聚类数据的区块,我们将生成几个团状数据

from sklearn.datasets import make_blobs
import numpy as np
blobs, classes = make_blobs(500, centers=3)
from sklearn.cluster import KMeans
kmean = KMeans(n_clusters=3)
kmean.fit(blobs)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

First, we'll look at silhouette distance. Silhouette distance is the ratio of the difference between in-cluster dissimilarity, the closest out-of-cluster dissimilarity, and the maximum of these two values. It can be thought of as a measure of how separate the clusters are.

首先,我们看一下轮廓的距离,轮廓距离是非相似的被聚类的结果和最近的非相似的未被聚类的结果的二者间的最大值,之间不同点的比率。(就是被选中为某个聚类的半径,和最近的那个未被选中的值的半径,两者的最大值)

Let's look at the distribution of distances from the points to the cluster centers; it's useful to understand silhouette distances:让我们看一下分类中心到各个点的距离的分布情况,这对于理解轮廓距离非常有用。

from sklearn import metrics
silhouette_samples = metrics.silhouette_samples(blobs,kmean.labels_)
np.column_stack((classes[:5], silhouette_samples[:5]))
array([[0.        , 0.64137447],
       [1.        , 0.82054529],
       [2.        , 0.5215416 ],
       [0.        , 0.6496082 ],
       [1.        , 0.75946336]])
f, ax = plt.subplots(figsize=(10, 5))
ax.hist(silhouette_samples)
ax.set_title("Hist of Silhouette Samples")

The following is the output:如下图所示

Notice that generally the higher the number of coefficients are closer to 1 (which is good) the better the score.

注意到总体较高的系数的数量都接近1,越近越高。

How it works…怎么工作的

The average of the silhouette coefficients is often used to describe the entire model's fit:

轮廓系数的平均值经常被用于描述整个模型的拟合情况。

silhouette_samples.mean()
0.6040968760162471

It's very common; in fact, the metrics module exposes a function to arrive at the value we just got:

这很常见,事实上,度量指标模型揭示了我们想要的值的函数

metrics.silhouette_score(blobs, kmean.labels_)
0.6040968760162471

Now, let's fit the models of several cluster counts and see what the average silhouette score looks like:

现在,让我们拟合几种聚类模型来看看轮廓得分的平均值会有怎样的变化。

# first new ground truth首先新的分类准确性

>>> blobs, classes = make_blobs(500, centers=10)
>>> sillhouette_avgs = []

# this could take a while这将花费一定时间

for k in range(2, 60):
    kmean = KMeans(n_clusters=k).fit(blobs)
    sillhouette_avgs.append(metrics.silhouette_score(blobs,kmean.labels_))
f, ax = plt.subplots(figsize=(7, 5))
ax.plot(sillhouette_avgs)

The following is the output:输出如下

This plot shows that the silhouette averages as the number of centroids increase. We can see that the optimum number, according to the data generating process, is 3, but here it looks like it's around 6 or 7. This is the reality of clustering; quite often, we won't get the correct numbe of clusters, we can only really hope to estimate the number of clusters to some approximation.

这幅图表现了轮廓均值随着形心数量增加的变化,我们能看到依靠数据生成的过程最优的数量是3,但是图里看起来是6或7,这就是聚类的现实,我们通常不能得到正确的聚类数量,我们只能期待估计一些近似的类别数量。

原文链接:http://www.packtpub.com

原文作者:Trent Hauck

相关文章

  • Creating binary features through thresholding通过阈值来生成二元特征

    In the last recipe, we looked at transforming our data into the standard normal ...

    到不了的都叫做远方
  • Fitting a line through data一条穿过数据的拟合直线

    Now, we get to do some modeling! It's best to start simple; therefore, we'll loo...

    到不了的都叫做远方
  • Regression model evaluation回归模型评估

    We learned about quantifying the error in classification, now we'll discuss quan...

    到不了的都叫做远方
  • 移动操作的空间动作图(CS RO)

    本文提出了一种新的动作表示形式,用于学习执行复杂的移动操作任务。在典型的深度Q学习设置中,训练卷积神经网络(ConvNet)从表示当前状态的图像(例如,场景的S...

    时代在召唤
  • Three Paper Thursday: What’s Intel SGX Good For?

    Software Guard eXtensions (SGX) represents Intel’s latest foray into trusted com...

    仇诺伊
  • Go is not (very) simple, folks

    I’ve recently started coding a little bit in Go, mostly out of curiosity. I’d kn...

    李海彬
  • SAP CRM的订单模型移植到S/4HANA之后,到底做了哪些改进?

    Overall idea One order model consists of a series of objects with two different...

    Jerry Wang
  • BookNote: Refactoring - Improving the Design of Existing Code

    绿巨人
  • 【Codeforces】1213A - Chips Moving

    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。 ...

    喜欢ctrl的cxk
  • 一个续写故事达到人类水平的AI,OpenAI大规模无监督语言模型GPT-2

    AI 科技评论按:模型大小的比拼还在继续!自谷歌大脑的 2.77 亿参数的语言模型 Transformer-XL 之后,OpenAI 也完成了自己具有 15 亿...

    AI科技评论

扫码关注云+社区

领取腾讯云代金券