# Optimizing the number of centroids最优化形心数量

Centroids are difficult to interpret, and it can also be very difficult to determine whether we have the correct number of centroids. It's important to understand whether your data is unlabeled or not as this will directly influence the evaluation measures we can use.

Evaluating the model performance for unsupervised techniques is a challenge. Consequently,sklearn has several methods to evaluate clustering when a ground truth is known, and very few for when it isn't.We'll start with a single cluster model and evaluate its similarity. This is more for the purpose of mechanics as measuring the similarity of one cluster count is clearly not useful in finding the ground truth number of clusters.

How to do it…怎么做

To get started we'll create several blobs that can be used to simulate clusters of data:为了模拟能够被使用的聚类数据的区块，我们将生成几个团状数据

```from sklearn.datasets import make_blobs
import numpy as np
blobs, classes = make_blobs(500, centers=3)
from sklearn.cluster import KMeans
kmean = KMeans(n_clusters=3)
kmean.fit(blobs)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)```

First, we'll look at silhouette distance. Silhouette distance is the ratio of the difference between in-cluster dissimilarity, the closest out-of-cluster dissimilarity, and the maximum of these two values. It can be thought of as a measure of how separate the clusters are.

Let's look at the distribution of distances from the points to the cluster centers; it's useful to understand silhouette distances:让我们看一下分类中心到各个点的距离的分布情况，这对于理解轮廓距离非常有用。

```from sklearn import metrics
silhouette_samples = metrics.silhouette_samples(blobs,kmean.labels_)
np.column_stack((classes[:5], silhouette_samples[:5]))
array([[0.        , 0.64137447],
[1.        , 0.82054529],
[2.        , 0.5215416 ],
[0.        , 0.6496082 ],
[1.        , 0.75946336]])
f, ax = plt.subplots(figsize=(10, 5))
ax.hist(silhouette_samples)
ax.set_title("Hist of Silhouette Samples")```

The following is the output:如下图所示

Notice that generally the higher the number of coefficients are closer to 1 (which is good) the better the score.

How it works…怎么工作的

The average of the silhouette coefficients is often used to describe the entire model's fit:

```silhouette_samples.mean()
0.6040968760162471```

It's very common; in fact, the metrics module exposes a function to arrive at the value we just got:

```metrics.silhouette_score(blobs, kmean.labels_)
0.6040968760162471```

Now, let's fit the models of several cluster counts and see what the average silhouette score looks like:

# first new ground truth首先新的分类准确性

```>>> blobs, classes = make_blobs(500, centers=10)
>>> sillhouette_avgs = []```

# this could take a while这将花费一定时间

```for k in range(2, 60):
kmean = KMeans(n_clusters=k).fit(blobs)
sillhouette_avgs.append(metrics.silhouette_score(blobs,kmean.labels_))
f, ax = plt.subplots(figsize=(7, 5))
ax.plot(sillhouette_avgs)```

The following is the output:输出如下

This plot shows that the silhouette averages as the number of centroids increase. We can see that the optimum number, according to the data generating process, is 3, but here it looks like it's around 6 or 7. This is the reality of clustering; quite often, we won't get the correct numbe of clusters, we can only really hope to estimate the number of clusters to some approximation.

• ### Creating binary features through thresholding通过阈值来生成二元特征

In the last recipe, we looked at transforming our data into the standard normal ...

• ### Fitting a line through data一条穿过数据的拟合直线

Now, we get to do some modeling! It's best to start simple; therefore, we'll loo...

• ### Regression model evaluation回归模型评估

We learned about quantifying the error in classification, now we'll discuss quan...

• ### 移动操作的空间动作图(CS RO)

本文提出了一种新的动作表示形式，用于学习执行复杂的移动操作任务。在典型的深度Q学习设置中，训练卷积神经网络（ConvNet）从表示当前状态的图像（例如，场景的S...

• ### Three Paper Thursday: What’s Intel SGX Good For?

Software Guard eXtensions (SGX) represents Intel’s latest foray into trusted com...

• ### Go is not (very) simple, folks

I’ve recently started coding a little bit in Go, mostly out of curiosity. I’d kn...

• ### SAP CRM的订单模型移植到S/4HANA之后，到底做了哪些改进？

Overall idea One order model consists of a series of objects with two different...

• ### 【Codeforces】1213A - Chips Moving

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。 ...

• ### 一个续写故事达到人类水平的AI，OpenAI大规模无监督语言模型GPT-2

AI 科技评论按：模型大小的比拼还在继续！自谷歌大脑的 2.77 亿参数的语言模型 Transformer-XL 之后，OpenAI 也完成了自己具有 15 亿...

### 活动推荐 