前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Using KMeans to cluster data使用K均值来聚类数据

Using KMeans to cluster data使用K均值来聚类数据

作者头像
到不了的都叫做远方
修改2020-04-23 11:21:05
7900
修改2020-04-23 11:21:05
举报

Clustering is a very useful technique. Often, we need to divide and conquer when taking actions. Consider a list of potential customers for a business. A business might need to group customers into cohorts, and then departmentalize responsibilities for these cohorts.Clustering can help facilitate the clustering process.KMeans is probably one of the most well-known clustering algorithms and, in a larger sense, one of the most well-known unsupervised learning techniques.

聚类算法是非常有用的技术,当我们采取行动时,我们需要区分对待。想象一个含有潜在的商业客户的列表,商业需要把客户分到不同的组里,然后区分不同组的责任,聚类算法能帮助促进聚类过程,KMeans可能是最著名的分类算法之一,众所周知,最著名的无监督学习技术之一

Getting ready准备工作

First, let's walk through some simple clustering, then we'll talk about how KMeans works:

首先,我们通过一些简单的聚类,然后讨论KMeans如何运行的。

代码语言:javascript
复制
from sklearn.datasets import make_blobs
blobs, classes = make_blobs(500, centers=3)

Also, since we'll be doing some plotting, import matplotlib as shown:

同样,让我们画些图,像下面这样导入matplotlib:

代码语言:javascript
复制
import matplotlib.pyplot as plt

How to do it…怎么做

We are going to walk through a simple example that clusters blobs of fake data. Then we'll talk a little bit about how KMeans works to find the optimal number of blobs.Looking at our blobs, we can see that there are three distinct clusters:

我们将要通过简单的例子,用虚拟数据聚类成点集。然后我们讨论一点关于KMeans是如何找到最合适的点的数量。我们能看到这里有3个清晰的组:

代码语言:javascript
复制
f, ax = plt.subplots(figsize=(7.5, 7.5))
ax.scatter(blobs[:, 0], blobs[:, 1], color=rgb[classes])
rgb = np.array(['r', 'g', 'b'])
ax.set_title("Blobs")

The output is as follows:输出图形如下:

Now we can use KMeans to find the centers of these clusters. In the first example, we'll pretend we know that there are three centers:

现在我们使用KMeans来找到这些组的中心。在第一个例子里,我们假装我们知道这里有三个中心:

代码语言:javascript
复制
from sklearn.cluster import KMeans
kmean = KMeans(n_clusters=3)
kmean.fit(blobs)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
kmean.cluster_centers_
array([[-6.80788414, -4.46656975],
       [ 2.42680019, -8.48849593],
       [-3.31953493, -3.87018295]])
f, ax = plt.subplots(figsize=(7.5, 7.5))
ax.scatter(blobs[:, 0], blobs[:, 1], color=rgb[classes])
ax.scatter(kmean.cluster_centers_[:, 0],kmean.cluster_centers_[:, 1], marker='*', s=250,
color='black', label='Centers')
ax.set_title("Blobs")
ax.legend(loc='best')

The following screenshot shows the output:以下是输出结果:

Other attributes are useful too. For instance, the labels_ attribute will produce the expected label for each point:

其他属性也很有用,比如,labels_ attribute将给每个点生成要求的标签。

代码语言:javascript
复制
kmean.labels_[:5]
array([1, 1, 2, 2, 1], dtype=int32)

We can check whether kmean.labels_ is the same as classes, but because KMeans has no knowledge of the classes going in, it cannot the sample index values to both classes:

我们检查kmean.labels_是否是同样的类,但是因为KMeans不知道具体的类别,它不能给每个类指派样例索引

代码语言:javascript
复制
classes[:5]
array([0, 0, 2, 2, 0])

Feel free to swap 1 and 0 in classes to see if it matches up with labels_ .

在分类中简单的转换1为0来看看它是否与 labels_相匹配。

The transform function is quite useful in the sense that it will output the distance between each point and centroid:

在要输出每个点或质心之间的距离时,转换函数是非常有用的

代码语言:javascript
复制
kmean.transform(blobs)[:5]
array([[ 0.84207543, 10.78904527,  3.9550393 ],
       [10.07712825,  0.6313236 ,  7.5485466 ],
       [ 3.34770011,  8.57737829,  1.28876016],
       [ 0.48377065,  9.88267005,  3.6449556 ],
       [ 9.84986949,  1.20386272,  6.86385856]])

How it works...怎么运行的

KMeans is actually a very simple algorithm that works to minimize the within-cluster sum of square distances from the mean. We'll be minimizing the sum of squares yet again!

KMeans其实是一个非常简单的计算集群之间距离的平方和的最小均值的算法,我们将要再次计算平方和的最小值。

It does this by first setting a pre-specified number of clusters, K, and then alternating between the following:

它在预先定义了聚类数量K后执行,然后在以下步骤中交替。

1、 Assigning each observation to the nearest cluster 分配每一个观测值到最近的集合

2、 Updating each centroid by calculating the mean of each observation assigned to this cluster 通过计算每一个被分配到集合里的观测值来更新每一个质心

This happens until some specified criterion is met.直到遇到确定的规则后,该步骤才会停止。

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档