专栏首页翻译scikit-learn CookbookOptimizing the number of centroids最优化形心数量

Optimizing the number of centroids最优化形心数量

Centroids are difficult to interpret, and it can also be very difficult to determine whether we have the correct number of centroids. It's important to understand whether your data is unlabeled or not as this will directly influence the evaluation measures we can use.


Getting ready准备工作

Evaluating the model performance for unsupervised techniques is a challenge. Consequently,sklearn has several methods to evaluate clustering when a ground truth is known, and very few for when it isn't.We'll start with a single cluster model and evaluate its similarity. This is more for the purpose of mechanics as measuring the similarity of one cluster count is clearly not useful in finding the ground truth number of clusters.


How to do it…怎么做

To get started we'll create several blobs that can be used to simulate clusters of data:为了模拟能够被使用的聚类数据的区块,我们将生成几个团状数据

from sklearn.datasets import make_blobs
import numpy as np
blobs, classes = make_blobs(500, centers=3)
from sklearn.cluster import KMeans
kmean = KMeans(n_clusters=3)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

First, we'll look at silhouette distance. Silhouette distance is the ratio of the difference between in-cluster dissimilarity, the closest out-of-cluster dissimilarity, and the maximum of these two values. It can be thought of as a measure of how separate the clusters are.


Let's look at the distribution of distances from the points to the cluster centers; it's useful to understand silhouette distances:让我们看一下分类中心到各个点的距离的分布情况,这对于理解轮廓距离非常有用。

from sklearn import metrics
silhouette_samples = metrics.silhouette_samples(blobs,kmean.labels_)
np.column_stack((classes[:5], silhouette_samples[:5]))
array([[0.        , 0.64137447],
       [1.        , 0.82054529],
       [2.        , 0.5215416 ],
       [0.        , 0.6496082 ],
       [1.        , 0.75946336]])
f, ax = plt.subplots(figsize=(10, 5))
ax.set_title("Hist of Silhouette Samples")

The following is the output:如下图所示

Notice that generally the higher the number of coefficients are closer to 1 (which is good) the better the score.


How it works…怎么工作的

The average of the silhouette coefficients is often used to describe the entire model's fit:



It's very common; in fact, the metrics module exposes a function to arrive at the value we just got:


metrics.silhouette_score(blobs, kmean.labels_)

Now, let's fit the models of several cluster counts and see what the average silhouette score looks like:


# first new ground truth首先新的分类准确性

>>> blobs, classes = make_blobs(500, centers=10)
>>> sillhouette_avgs = []

# this could take a while这将花费一定时间

for k in range(2, 60):
    kmean = KMeans(n_clusters=k).fit(blobs)
f, ax = plt.subplots(figsize=(7, 5))

The following is the output:输出如下

This plot shows that the silhouette averages as the number of centroids increase. We can see that the optimum number, according to the data generating process, is 3, but here it looks like it's around 6 or 7. This is the reality of clustering; quite often, we won't get the correct numbe of clusters, we can only really hope to estimate the number of clusters to some approximation.



原文作者:Trent Hauck


  • Creating binary features through thresholding通过阈值来生成二元特征

    In the last recipe, we looked at transforming our data into the standard normal ...

  • Fitting a line through data一条穿过数据的拟合直线

    Now, we get to do some modeling! It's best to start simple; therefore, we'll loo...

  • Regression model evaluation回归模型评估

    We learned about quantifying the error in classification, now we'll discuss quan...

  • 移动操作的空间动作图(CS RO)


  • Three Paper Thursday: What’s Intel SGX Good For?

    Software Guard eXtensions (SGX) represents Intel’s latest foray into trusted com...

  • Go is not (very) simple, folks

    I’ve recently started coding a little bit in Go, mostly out of curiosity. I’d kn...

  • SAP CRM的订单模型移植到S/4HANA之后,到底做了哪些改进?

    Overall idea One order model consists of a series of objects with two different...

    Jerry Wang
  • BookNote: Refactoring - Improving the Design of Existing Code

  • 【Codeforces】1213A - Chips Moving

    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。 ...

  • 一个续写故事达到人类水平的AI,OpenAI大规模无监督语言模型GPT-2

    AI 科技评论按:模型大小的比拼还在继续!自谷歌大脑的 2.77 亿参数的语言模型 Transformer-XL 之后,OpenAI 也完成了自己具有 15 亿...