文章/答案/技术大牛

发布

问DBSCAN和min_samples
EN

Stack Overflow用户

提问于 2020-03-03 02:44:28

回答 2查看 3.6K关注 0票数 3

我一直试图使用DBSCAN来检测异常值，根据我的理解DBSCAN输出-1作为异常值，1作为内联值，但是在我运行代码后，我得到的数字不是-1或1，有人能解释一下为什么吗？另外，用反复试验找出eps的最佳值也是正常的，因为我想不出找到最好的eps值的方法。

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.cluster import DBSCAN



df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)


# Dropping columns with low feature importance
del df['AmbTemp_DegC']
del df['NacelleOrientation_Deg']
del df['MeasuredYawError']



#applying DBSCAN


DBSCAN = DBSCAN(eps = 1.8, min_samples =10,n_jobs=-1)

df['anomaly'] = DBSCAN.fit_predict(df)


np.unique(df['anomaly'],return_counts=True)

(array([  -1,    0,    1, ..., 8462, 8463, 8464]),
array([1737565, 3539278, 4455734, ...,      13,       8,       8]))

谢谢。

python

machine-learning

cluster-analysis

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-03-03 03:23:44

实际上，您还没有真正了解DBSCAN的概念。

这是维基百科的一份副本：

一个点p是一个核心点，如果至少minPts点在它的ε距离内(包括p)。如果点q在距核心点p的距离内，则点q可直接从p到达，仅称从核心点可直接到达点。如果有路径p1，.，pn有p1 =p和pn = q，则从p可以到达点q，其中每个pi+1都可以从pi直接到达。请注意，这意味着路径上的所有点都必须是核心点，除了q之外。所有不能从其他点到达的点都是离群点或噪声点。

所以用简单的话来说，这个想法是：

任何通过epsilon距离有min_samples邻居的样品都是核心样本。
任何不是核心但至少有一个核心邻居(距离小于eps)的数据样本都是可直接访问的样本，可以添加到集群中。
任何数据样本，如果不是可直接访问的，也不是核心的，但至少有一个可直接访问的邻居(距离小于eps)，都是可访问的样本，并且将被添加到集群中。
任何其他的例子都被认为是噪音，离群点或任何你想要命名的。(这些都将被标记为-1)。

根据集群的参数(eps和min_samples)，您很可能有两个以上的集群。这就是在集群结果中看到除0和-1以外的其他值的原因。

回答你的第二个问题

用试验和误差法找出eps的最佳值也是正常的，

如果您的意思是进行交叉验证(在您知道集群标签的集合上，或者您可以近似正确的集群)，我认为这是正常的方法。

PS：纸是非常好和全面的。我强烈建议你看看。祝好运。

票数 5

Stack Overflow用户

发布于 2020-03-13 01:36:28

我发现这是了解DBSCAN工作原理的一个很好的例子。

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

a = np.array(labels)
a

结果：

array([ 0,  1,  0,  2,  0,  1,  1,  2,  0,  0,  1,  1,  1,  2,  1,  0, -1,
        1,  1,  2,  2,  2,  2,  2,  1,  1,  2,  0,  0,  2,  0,  1,  1,  0,
        1,  0,  2,  0,  0,  2,  2,  1,  1,  1,  1,  1,  0,  2,  0,  1,  2,
        2,  1,  1,  2,  2,  1,  0,  2,  1,  2,  2,  2,  2,  2,  0,  2,  2,
        0,  0,  0,  2,  0,  0,  2,  1, -1,  1,  0,  2,  1,  1,  0,  0,  0,
        0,  1,  2,  1,  2,  2,  0,  1,  0,  1, -1,  1,  1,  0,  0,  2,  1,
        2,  0,  2,  2,  2,  2, -1,  0, -1,  1,  1,  1,  1,  0,  0,  1,  0,
        1,  2,  1,  0,  0,  1,  2,  1,  0,  0,  2,  0,  2,  2,  2,  0, -1,
        2,  2,  0,  1,  0,  2,  0,  0,  2,  2, -1,  2,  1, -1,  2,  1,  1,
        2,  2,  2,  0,  1,  0,  1,  0,  1,  0,  2,  2, -1,  1,  2,  2,  1,
        0,  1,  2,  2,  2,  1,  1,  2,  2,  0,  1,  2,  0,  0,  2,  0,  0,
        1,  0,  1,  0,  1,  1,  2,  2,  0,  0,  1,  1,  2,  1,  2,  2,  2,
        2,  0,  2,  0,  2,  2,  0,  2,  2,  2,  0,  0,  1,  1,  1,  2,  2,
        2,  2,  1,  2,  2,  0,  0,  2,  0,  0,  0,  1,  0,  1,  1,  1,  2,
        1,  1,  0,  1,  2,  2,  1,  2,  2,  1,  0,  0,  1,  1,  1,  0,  1,
        0,  2,  0,  2,  2,  2,  2,  2,  1,  1,  0,  0,  1,  1,  0,  0,  2,
        1, -1,  2,  1,  1,  2,  1,  2,  0,  2,  2,  0,  1,  2,  2,  0,  2,
        2,  0,  0,  2,  0,  2,  0,  2,  1,  0,  0,  0,  1,  2,  1,  2,  2,
        0,  2,  2,  0,  0,  2,  1,  1,  1,  1,  1,  0,  1,  1,  1,  1,  0,
        0,  1,  1,  1,  0,  2,  0,  1,  2,  2,  0,  0,  2,  0,  2,  1,  0,
        2,  0,  2,  0,  2,  2,  0,  1,  0,  1,  0,  2,  2,  1,  1,  1,  2,
        0,  2,  0,  2,  1,  2,  2,  0,  1,  0,  1,  0,  0,  0,  0,  2,  0,
        2,  0,  1,  0,  1,  2,  1,  1,  1,  0,  1,  1,  0,  2,  1,  0,  2,
        2,  1,  1,  2,  2,  2,  1,  2,  1,  2,  0,  2,  1,  2,  1,  0,  1,
        0,  1,  1,  0,  1,  2, -1,  1,  0,  0,  2,  1,  2,  2,  2,  2,  1,
        0,  0,  0,  0,  1,  0,  2,  1,  0,  1,  2,  0,  0,  1,  0,  1,  1,
        0, -1,  0,  2,  2,  2,  1,  1,  2,  0,  1,  0,  0,  1,  0,  1,  1,
        2,  2, -1,  0,  1,  2,  2,  1,  1,  1,  1,  0,  0,  0,  2,  2,  1,
        2,  1,  0,  0,  1,  2,  1,  0,  0,  2,  0,  1,  0,  2,  1,  0,  2,
        2,  1,  0,  0,  0,  2,  1,  1,  0,  2,  0,  0,  1,  1,  1,  1,  0,
        1,  0,  1,  0,  0,  2,  0,  1,  1,  2,  1,  1,  0,  1,  0,  2,  1,
        0,  0,  1,  0,  1,  1,  2,  2,  1,  2,  2,  1,  2,  1,  1,  1,  1,
        2,  0,  0,  0,  1,  2,  2,  0,  2,  0,  2,  1,  0,  1,  1,  0,  0,
        1,  2,  1,  2,  2,  0,  2,  1,  1,  1,  2,  0,  0,  2,  0,  2,  2,
        0,  2,  0,  1,  1,  1,  1,  0,  0,  0,  2,  1,  1,  1,  1,  2,  2,
        2,  0,  2,  1,  1,  0,  0,  1,  0,  2,  1,  2,  1,  0,  2,  2,  0,
        0,  1,  0,  0,  2,  0,  0,  0,  2,  0,  2,  0,  0,  1,  1,  0,  0,
        1,  2,  2,  0,  0,  0,  0,  2, -1,  1,  1,  2,  1,  0,  0,  2,  2,
        0,  1,  2,  0,  1,  2,  2,  1,  0,  0, -1, -1,  2,  0,  0,  0,  2,
       -1,  2,  0,  1,  1,  1,  1,  1,  0,  0,  2,  1,  2,  0,  1,  1,  1,
        0,  2,  1,  1, -1,  2,  1,  2,  0,  2,  2,  1,  0,  0,  0,  1,  1,
        2,  0,  0,  2,  2,  1,  2,  2,  2,  0,  2,  1,  2,  1,  1,  1,  2,
        0,  2,  0,  2,  2,  0,  0,  2,  1,  2,  0,  2,  0,  0,  0,  1,  0,
        2,  1,  2,  0,  1,  0,  0,  2,  0,  2,  1,  1,  2,  1,  0,  1,  2,
        1,  2], dtype=int64)

这-1个数据点是离群点。让我们数一下离群值的数量，看看它是否与我们在上面的图像中看到的相匹配。

list(a)
b = a.tolist()
count = b.count(-1)
count

结果：

我们抓到18了！完美！！

问DBSCAN和min_samples
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问DBSCAN和min_samplesEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问DBSCAN和min_samples
EN