专栏首页流川疯编写程序的艺术使用IsolationForest 与Meanshift算法进行异常检测

使用IsolationForest 与Meanshift算法进行异常检测

什么是异常检测

异常定义

https://blog.csdn.net/App_12062011/article/details/84797641

现有数据挖掘研究大多集中于发现适用于大部分数据的常规模式,在许多应用领域中,异常数据通常作为噪音而忽略,许多数据挖掘算法试图降低或消除异常数据的影响。而在有些应用领域识别异常数据是许多工作的基础和前提,异常数据会带给我们新的视角。

如在欺诈检测中,异常数据可能意味欺诈行为的发生,在入侵检测中异常数据可能意味入侵行为的发生。

异常发掘

异常挖掘可以描述为:给定N个数据对象和所期望的异常数据个数,发现明显不同、意外,或与其它数据不一致的前k个对象。 异常挖掘问题由两个子问题构成: (1)如何度量异常; (2)如何有效发现异常。

异常检测的一般步骤

一般步骤

  • 构建“正常”行为的资料集

资料集可以是针对数据整体的图案或者汇总统计- 资料集可以是针对数据整体的图案或者汇总统计

  • 通过使用“正常”资料集检测异常行为

异常行为是特征与“正常”资料有显著差别的观察对象

异常检测方法的类型

  • 分类和聚类
  • 基于统计的方法
  • 基于距离和基于密度的方法
  • 基于图形的方法

数据加载


Meanshift 聚类

从源代码可以看出: https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/cluster/mean_shift_.py#L298

<class ‘sklearn.cluster.mean_shift_.MeanShift’> 中显示不能分类的数据其类别标签就是-1

def mean_shift(X, bandwidth=None, seeds=None, bin_seeding=False,
               min_bin_freq=1, cluster_all=True, max_iter=300,
               n_jobs=None):
    """Perform mean shift clustering of data using a flat kernel.
    Read more in the :ref:`User Guide <mean_shift>`.
    Parameters
    ----------
    X : array-like, shape=[n_samples, n_features]
        Input data.
    bandwidth : float, optional
        Kernel bandwidth.
        If bandwidth is not given, it is determined using a heuristic based on
        the median of all pairwise distances. This will take quadratic time in
        the number of samples. The sklearn.cluster.estimate_bandwidth function
        can be used to do this more efficiently.
    seeds : array-like, shape=[n_seeds, n_features] or None
        Point used as initial kernel locations. If None and bin_seeding=False,
        each data point is used as a seed. If None and bin_seeding=True,
        see bin_seeding.
    bin_seeding : boolean, default=False
        If true, initial kernel locations are not locations of all
        points, but rather the location of the discretized version of
        points, where points are binned onto a grid whose coarseness
        corresponds to the bandwidth. Setting this option to True will speed
        up the algorithm because fewer seeds will be initialized.
        Ignored if seeds argument is not None.
    min_bin_freq : int, default=1
       To speed up the algorithm, accept only those bins with at least
       min_bin_freq points as seeds.
    cluster_all : boolean, default True
        If true, then all points are clustered, even those orphans that are
        not within any kernel. Orphans are assigned to the nearest kernel.
        If false, then orphans are given cluster label -1.
    max_iter : int, default 300
        Maximum number of iterations, per seed point before the clustering
        operation terminates (for that seed point), if has not converged yet.
    n_jobs : int or None, optional (default=None)
        The number of jobs to use for the computation. This works by computing
        each of the n_init runs in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.
        .. versionadded:: 0.17
           Parallel Execution using *n_jobs*.
    Returns
    -------
    cluster_centers : array, shape=[n_clusters, n_features]
        Coordinates of cluster centers.
    labels : array, shape=[n_samples]
        Cluster labels for each point.
    Notes
    -----
    For an example, see :ref:`examples/cluster/plot_mean_shift.py
    <sphx_glr_auto_examples_cluster_plot_mean_shift.py>`.
    """

    if bandwidth is None:
        bandwidth = estimate_bandwidth(X, n_jobs=n_jobs)
    elif bandwidth <= 0:
        raise ValueError("bandwidth needs to be greater than zero or None,\
            got %f" % bandwidth)
    if seeds is None:
        if bin_seeding:
            seeds = get_bin_seeds(X, bandwidth, min_bin_freq)
        else:
            seeds = X
    n_samples, n_features = X.shape
    center_intensity_dict = {}

    # We use n_jobs=1 because this will be used in nested calls under
    # parallel calls to _mean_shift_single_seed so there is no need for
    # for further parallelism.
    nbrs = NearestNeighbors(radius=bandwidth, n_jobs=1).fit(X)

    # execute iterations on all seeds in parallel
    all_res = Parallel(n_jobs=n_jobs)(
        delayed(_mean_shift_single_seed)
        (seed, X, nbrs, max_iter) for seed in seeds)
    # copy results in a dictionary
    for i in range(len(seeds)):
        if all_res[i] is not None:
            center_intensity_dict[all_res[i][0]] = all_res[i][1]

    if not center_intensity_dict:
        # nothing near seeds
        raise ValueError("No point was within bandwidth=%f of any seed."
                         " Try a different seeding strategy \
                         or increase the bandwidth."
                         % bandwidth)

    # POST PROCESSING: remove near duplicate points
    # If the distance between two kernels is less than the bandwidth,
    # then we have to remove one because it is a duplicate. Remove the
    # one with fewer points.

    sorted_by_intensity = sorted(center_intensity_dict.items(),
                                 key=lambda tup: (tup[1], tup[0]),
                                 reverse=True)
    sorted_centers = np.array([tup[0] for tup in sorted_by_intensity])
    unique = np.ones(len(sorted_centers), dtype=np.bool)
    nbrs = NearestNeighbors(radius=bandwidth,
                            n_jobs=n_jobs).fit(sorted_centers)
    for i, center in enumerate(sorted_centers):
        if unique[i]:
            neighbor_idxs = nbrs.radius_neighbors([center],
                                                  return_distance=False)[0]
            unique[neighbor_idxs] = 0
            unique[i] = 1  # leave the current point as unique
    cluster_centers = sorted_centers[unique]

    # ASSIGN LABELS: a point belongs to the cluster that it is closest to
    nbrs = NearestNeighbors(n_neighbors=1, n_jobs=n_jobs).fit(cluster_centers)
    labels = np.zeros(n_samples, dtype=np.int)
    distances, idxs = nbrs.kneighbors(X)
    if cluster_all:
        labels = idxs.flatten()
    else:
        labels.fill(-1)
        bool_selector = distances.flatten() <= bandwidth
        labels[bool_selector] = idxs.flatten()[bool_selector]
    return cluster_centers, labels

代码注释参考: https://blog.csdn.net/jiaqiangbandongg/article/details/53557500

IsolationForest 异常检测


参考

  • 《数据挖掘与商务智能》-第八章 异常检测 【西安电子科技大学 软件学院】主讲人:黄健斌

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • caffe安装成功

    在scalerstalk成长会机器学习小组群中,学过ng的机器学习视频后,来自中山大学的博士组长推荐我们下一步学习caffe,并看最新的deeplearning...

    努力在北京混出人样
  • 基于 Pytorch 的鞋子标签自动标注[译]

    涉及的问题是对文本生成描述文本(generating text to describe images).

    AIHGF
  • SVD在推荐系统中的应用

    参考自:http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/

    AIHGF
  • 主成分分析(PCA)简介

    主成分分析实例:一个平均值为(1, 3)、标准差在(0.878, 0.478)方向上为3、在其正交方向为1的高斯分布。这里以黑色显示的两个向量是这个分布的协方差...

    iOSDevLog
  • 深入理解机器学习:从原理到算法 学习笔记-第1周 02简易入门

    领域集:X,例如所有木瓜的集合。 标签集:Y,目前仅讨论二元集合,如{0,1}或者{−1,+1},表示木瓜好吃和不好吃。 训练数据:形如S = ((x ...

    csxiaoyao
  • 深入理解机器学习:从原理到算法 学习笔记-第1周 01引论

      以老鼠怯饵效应为例,老鼠根据过往的经验预测所食的食物未来对自己的影响,这就是一种学习机制。再如垃圾邮件过滤机制也是如此,虽然垃圾邮件的判别可以通过已存在的邮...

    csxiaoyao
  • python包安装:高效方法

    对,没错,就是高效方法。尝试多种方法,安装包总是出现安装好了,缺不能加载这个包,各种谷歌方法,尝试各种方法,都不尽如意,问题依然频出。经过我多次尝试,方法使用w...

    努力在北京混出人样
  • Kaggle课程 | lecture 1 机器学习算法、工具与流程概述

    常用scikit-learn ,文本分析用gensim,数据处理用Numpy、matplotlib、pandas,深度学习有tensorflow、caffe、k...

    努力在北京混出人样
  • Python3 机器学习简明教程

    1 机器学习介绍     1.1 什么是机器学习     1.2 机器学习的应用     1.3 机器学习基本流程与工作环节         1.3.1...

    iOSDevLog
  • 人工智能基础

    1950 年,艾伦.图灵 (Alan Turing) 在他的论文《计算机器与智能》 ( Compu- tmg Machinery and Intelligenc...

    iOSDevLog

扫码关注云+社区

领取腾讯云代金券