Using KMeans for outlier detection使用KMeans进行异常值检测

到不了的都叫做远方

修改于 2020-04-24 12:02:27

2K0

修改于 2020-04-24 12:02:27

In this chapter, we'll look at both the debate and mechanics of KMeans for outlier detection.It can be useful to isolate some types of errors, but care should be taken when using it.

这章，我们将讨论在处理离群值与KMeans的机械性。这再分离一些类型的误差很有用，但是使用的时候一定要小心。

Getting ready准备工作

In this recipe, we'll use KMeans to do outlier detections on a cluster of points. It's important to note that there are many "camps" when it comes to outliers and outlier detection. On one hand, we're potentially removing points that were generated by the data-generating process by removing outliers. On the other hand, outliers can be due to a measurement error or some other outside factor.

在这部分，我们将使用KMeans以一个点聚类后处理离群值。值得注意的是在处理离群值和离群值检验时会有不同“阵营”，一种是，我们删除使用数据生成步骤生成的离群点来删除离群值。另一种是，离群值来源于测量误差或者其他外部因素。

This is the most credence we'll give to the debate; the rest of this recipe is about finding outliers;we'll work under the assumption that our choice to remove outliers is justified.The act of outlier detection is a matter of finding the centroids of the clusters, and then identifying points that are potential outliers by their distances from the centroid.

当我们讨论时，这将是最可信的，其余部分主要是关于找到离群值，我们假设我们删除离群值是合理的。离群值检测的实质其实是找到聚类形心的方法，然后能够说明这些点在它们与形心的距离上是潜在的离群值。

How to do it...怎么做

First, we'll generate a single blob of 100 points, and then we'll identify the 5 points that are furthest from the centroid. These are the potential outliers:

首先我们生成一个100个点的群，然后找出5个离形心最远的点，它们是潜在的离群值：

from sklearn.datasets import make_blobs
X, labels = make_blobs(100, centers=1)
import numpy as np

It's important that the KMeans cluster has a single center. This idea is similar to a one-class SVM that is used for outlier detection:

KMeans聚类有一个单独的中心很重要，这很像是用于检测离群值的单分类支持向量机。

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=1)
kmeans.fit(X)

Now, let's look at the plot. For those playing along at home, try to guess which points will be identified as one of the five outliers:

现在，让我们看一下图，像独自在家玩一样，试一试猜出哪些点会被定义为这五个离群值之一：

f, ax = plt.subplots(figsize=(7, 5))
ax.set_title("Blob")
ax.scatter(X[:, 0], X[:, 1], label='Points')
ax.scatter(kmeans.cluster_centers_[:, 0],kmeans.cluster_centers_[:, 1], marker='*',label='Centroid',color='r')
ax.legend()

The following is the output:如下图所示

Now, let's identify the five closest points:现在，让我们定义这5个点：

distances = kmeans.transform(X)

# argsort returns an array of indexes which will sort the array in ascending order

#argsort返回一个倒序排列的索引数组

# so we reverse it via [::-1] and take the top five with [:5]

#所以我们颠倒他，来得到最顶上的5个数

sorted_idx = np.argsort(distances.ravel())[::-1][:5]

Now, let's see which plots are the farthest away:现在，让我们看一下图中最远的点

f, ax = plt.subplots(figsize=(7, 5))
ax.set_title("Single Cluster")
ax.scatter(X[:, 0], X[:, 1], label='Points')
ax.scatter(kmeans.cluster_centers_[:, 0],kmeans.cluster_centers_[:, 1],label='Centroid', color='r')
ax.scatter(X[sorted_idx][:, 0], X[sorted_idx][:, 1],label='Extreme Value', edgecolors='g',facecolors='none', s=100)
ax.legend(loc='best')

The following is the output:如下图所示

It's easy to remove these points if we like:删除它们很简单：

new_X = np.delete(X, sorted_idx, axis=0)

Also, the centroid clearly changes with the removal of these points:同时，形心当然会随着它们的删除后变化。

new_kmeans = KMeans(n_clusters=1)
new_kmeans.fit(new_X)

Let's visualize the difference between the old and new centroids:我们再次可视化新的形心和老的形心的不同：

f, ax = plt.subplots(figsize=(7, 5))
ax.set_title("Extreme Values Removed")
ax.scatter(new_X[:, 0], new_X[:, 1], label='Pruned Points')
ax.scatter(kmeans.cluster_centers_[:, 0],kmeans.cluster_centers_[:, 1], label='Old Centroid',color='r', s=80, alpha=.5)
ax.scatter(new_kmeans.cluster_centers_[:, 0],new_kmeans.cluster_centers_[:, 1], label='New Centroid',color='m', s=80, alpha=.5)
ax.legend(loc='best')

The following is the output:如下图所示(我的模型没什么变化，中间那一小点点的错位)

Clearly, the centroid hasn't moved much, which is to be expected when only removing the five most extreme values. This process can be repeated until we're satisfied that the data is representative of the process.

很明显，形心没有变化太多，这和我们期望的删除最大的五个值的结果相同。这个过程可以被重复直到令我们满意，就是数据能够代表过程。

How it works...怎么做的

As we've already seen, there is a fundamental connection between the Gaussian distribution and the KMeans clustering. Let's create an empirical Gaussian based off the centroid and sample covariance matrix and look at the probability of each point—theoretically, the five points we removed. This just shows that we have in fact removed the values with the least likelihood. This idea between distances and likelihoods is very important, and will come around quite often in your machine learning training.

如我们所见，在高斯分布和KMeans聚类中有一些基本的联系，让我们基于形心生成经验高斯方法和样本协方差矩阵，然后看一看我们移开的这五个点的每个点发生的理论上的可能性，它只是展示了我们移除出现概率最小的值，在距离和可能性之间的选择很重要，并且会时长出现在你的机器学习训练当中

Use the following command to create an empirical Gaussian:使用一下代码来创建一个经验高斯方法

from scipy import stats
emp_dist = stats.multivariate_normal(kmeans.cluster_centers_.ravel())
lowest_prob_idx = np.argsort(emp_dist.pdf(X))[:5]
np.all(X[sorted_idx] == X[lowest_prob_idx])
True

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

机器学习

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

机器学习

作者已关闭评论

0 条评论

热度

Using KMeans for outlier detection使用KMeans进行异常值检测

Using KMeans for outlier detection使用KMeans进行异常值检测

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐