DBSCAN聚类教程:DBSCAN算法原理以及Python实现

聚类算法是无监督学习中的重要部分,聚类算法包括K-means、k-mediods以及DBSCAN等。DBSCAN是基于距离测量(通常为欧几里德距离)和最小点数将彼此接近的点组合在一起。DBSCAN算法可以用来查找难以手动查找的数据中的关联和结构,通常用于生物学,医学,人物识别,管理系统等多个领域。

算法原理

DBSCAN聚类的过程像树生长一样,它从种子点开始,该种子点在eps的距离内至少具有MinPoints个点。我们沿着这些附近的点进行广度优先搜索。对于给定的点,我们检查它在半径内有多少个点。如果它的数量少于MinPoints,则此点变为叶子,我们不会继续从中增长群集。我们将其所有邻居添加到我们广度优先搜索的FIFO队列中。一旦广度优先搜索完成,我们就完成了该集群,我们永远不会重新计算其中的任何一点。我们选择一个新的任意种子点,并增长下一个集群。一直持续到所有点都已分配。DBSCAN还有一个新颖的地方,如果一个点的邻居数少于MinPoints,并且它不是另一个集群的叶节点,则它被标记为不属于任何集群的“噪声”点。噪声点被识别为选择新种子的过程的一部分 - 如果特定种子点没有足够的邻居,则将其标记为噪声点。

两个参数:eps和minpoints

DBSCAN算法主要有2个参数:

eps:两点之间的最小距离。这意味着如果两点之间的距离低于或等于该值(eps),则这些点被认为是相邻。如果选择的eps值太小,则很大一部分数据不会聚集。它将被视为异常值,因为不满足创建密集区域的点数。如果选择的值太大,则群集会被合并,这样会造成大多数对象处于同一群集中。因此应该根据数据集的距离来选择eps,一般来说eps值尽量取小一点。

minPoints:表示形成密集区域的最小点数。例如,如果我们将minPoints参数设置为5,那么我们需要至少5个点来形成密集区域。作为一般规则,minPoints可以从数据集中的多个维度(D)导出,因为minPoints≥D+ 1.对于具有噪声的数据集,较大的minPoints值通常更好,并且将形成更完美的簇。minPoints的最小值必须为3,数据集越大,对应选择的minPoints值越大。

区别于K-means

DBSCAN与K-means不同的是

  • 在k-means聚类中,每个聚类由质心表示,并且点被分配给最接近的质心。在DBSCAN中,没有质心,通过将附近的点彼此链接来形成簇。
  • k-means需要指定簇的数量k。DBSCAN中不需要,DBSCAN需要指定两个参数来决定两个附近点是否应该链接到同一个集群。这两个参数是距离阈值eps和MinPoints。
  • k-means运行多次迭代以汇聚到一组良好的集群上,并且集群分配可以在每次迭代时发生变化。DBSCAN只对数据进行一次传递,一旦将某个点分配给特定的群集,它就不会发生变化。

Python实现

下面通过Python代码实现来帮助大家更好地理解DBSCAN的算法原理,实现的重点在于说明算法,例如距离的优化计算。详细代码可以参见Github。

Github

https://github.com/chrisjmccormick/dbscan

DBSCAN代码实现如下:

import numpy

def MyDBSCAN(D, eps, MinPts):
    """
    Cluster the dataset `D` using the DBSCAN algorithm.

    MyDBSCAN takes a dataset `D` (a list of vectors), a threshold distance
    `eps`, and a required number of points `MinPts`.

    It will return a list of cluster labels. The label -1 means noise, and then
    the clusters are numbered starting from 1.
    """

    # This list will hold the final cluster assignment for each point in D.
    # There are two reserved values:
    #    -1 - Indicates a noise point
    #     0 - Means the point hasn't been considered yet.
    # Initially all labels are 0.    
    labels = [0]*len(D)

    # C is the ID of the current cluster.    
    C = 0

    # This outer loop is just responsible for picking new seed points--a point
    # from which to grow a new cluster.
    # Once a valid seed point is found, a new cluster is created, and the 
    # cluster growth is all handled by the 'expandCluster' routine.
    # For each point P in the Dataset D...
    # ('P' is the index of the datapoint, rather than the datapoint itself.)
    for P in range(0, len(D)):

        # Only points that have not already been claimed can be picked as new 
        # seed points.    
        # If the point's label is not 0, continue to the next point.
        if not (labels[P] == 0):
           continue

        # Find all of P's neighboring points.
        NeighborPts = regionQuery(D, P, eps)

        # If the number is below MinPts, this point is noise. 
        # This is the only condition under which a point is labeled 
        # NOISE--when it's not a valid seed point. A NOISE point may later 
        # be picked up by another cluster as a boundary point (this is the only
        # condition under which a cluster label can change--from NOISE to 
        # something else).
        if len(NeighborPts) < MinPts:
            labels[P] = -1
        # Otherwise, if there are at least MinPts nearby, use this point as the 
        # seed for a new cluster.    
        else: 
           # Get the next cluster label.
           C += 1

           # Assing the label to our seed point.
           labels[P] = C

           # Grow the cluster from the seed point.
           growCluster(D, labels, P, C, eps, MinPts)

    # All data has been clustered!
    return labels


def growCluster(D, labels, P, C, eps, MinPts):
    """
    Grow a new cluster with label `C` from the seed point `P`.

    This function searches through the dataset to find all points that belong
    to this new cluster. When this function returns, cluster `C` is complete.

    Parameters:
      `D`      - The dataset (a list of vectors)
      `labels` - List storing the cluster labels for all dataset points
      `P`      - Index of the seed point for this new cluster
      `C`      - The label for this new cluster.  
      `eps`    - Threshold distance
      `MinPts` - Minimum required number of neighbors
    """
    SearchQueue = [P]

    # For each point in the queue:
    #   1. Determine whether it is a branch or a leaf
    #   2. For branch points, add their unclaimed neighbors to the search queue
    i = 0
    while i < len(SearchQueue):    

        # Get the next point from the queue.        
        P = SearchQueue[i]
        # Find all the neighbors of P
        NeighborPts = regionQuery(D, P, eps)

        # If the number of neighbors is below the minimum, then this is a leaf
        # point and we move to the next point in the queue.
        if len(NeighborPts) < MinPts:
            i += 1
            continue

        # Otherwise, we have the minimum number of neighbors, and this is a 
        # branch point.

        # For each of the neighbors...
        for Pn in NeighborPts:

            # If Pn was labelled NOISE during the seed search, then we
            # know it's not a branch point (it doesn't have enough 
            # neighbors), so make it a leaf point of cluster C and move on.
            if labels[Pn] == -1:
               labels[Pn] = C
            # Otherwise, if Pn isn't already claimed, claim it as part of
            # C and add it to the search queue.   
            elif labels[Pn] == 0:
                # Add Pn to cluster C.
                labels[Pn] = C

                # Add Pn to the SearchQueue.
                SearchQueue.append(Pn)

        # Advance to the next point in the FIFO queue.
        i += 1        

    # We've finished growing cluster C!

def regionQuery(D, P, eps):
    """
    Find all points in dataset `D` within distance `eps` of point `P`.

    This function calculates the distance between a point P and every other 
    point in the dataset, and then returns only those points which are within a
    threshold distance `eps`.
    """
    neighbors = []

    # For each point in the dataset...
    for Pn in range(0, len(D)):

        # If the distance is below the threshold, add it to the neighbors list.
        if numpy.linalg.norm(D[P] - D[Pn]) < eps:
           neighbors.append(Pn)

    return neighbors

本文分享自微信公众号 - 深度学习与python(PythonDC)

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2019-05-29

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏机器学习与python集中营

神经网络可视化(一)——Netron

tensorflow,pytorch,mxnet每一个主流的深度学习框架都提供了相对应的可视化模板,那有没有一种方法更加具有通用性呢?我们会在论文中,相关文献中...

79130
来自专栏算法channel

Python读写csv文件专题教程(2)

如果我想修改age列的数据类型为float,read_csv时可以使用dtype调整,如下:

15220
来自专栏算法channel

Python读写csv文件专题教程(3)

如果导入的某些列为时间类型,但是导入时没有为此参数赋值,导入后就不是时间类型,如下:

17030
来自专栏Python3爬虫100例教程

面试前赶紧看了5道Python Web面试题,Python面试题No17

django在中间件中预设了6个方法,这6个方法区别在于不同的阶段执行,对输入或输出进行干预,方法如下:

20530
来自专栏算法channel

4 个Python数据读取的常见错误

read_csv()是python数据分析包pandas里面使用频次较高的函数之一。它包括的参数差不多20个,可能一开始未必需要完整知道每个参数作用。不过,随着...

22630
来自专栏算法channel

趣学Python数据分析:轴和索引

上一篇总结了Python数据处理包Pandas的DataFrame,介绍了Axes相关的属性和方法。文章的图形展示效果不是很友好,再换一种形式。

13440
来自专栏Y大宽

批量对多个测序文件进行fastqc

现在一共是728*2=1456个测序文件,需要全部进行质控。 fastqc的命令很简单,直接跟文件即可,参数里面主要用-o(输出路径)和-t(线程,一般用2或...

33150
来自专栏机器学习与python集中营

python绘图骚操作之plotly(一)——plotly的基本绘图方式

Plotly是一个非常著名且强大的开源数据可视化框架,它通过构建基于浏览器显示的web形式的可交互图表来展示信息,可创建多达数十种精美的图表和地图,本文就将以j...

2.2K20
来自专栏算法channel

Python编程这两处陷阱,很容易忽视

承接上文:Python解惑之对象可变与不可变,文中提到,对象的可变(mutable object) 与不可变(immutable object)值得重视。不可变...

8640
来自专栏机器学习与python集中营

【私人笔记】深度学习框架keras踩坑记

Keras 是一个用 Python 编写的高级神经网络 API,它能够以 TensorFlow, CNTK, 或者 Theano 作为后端运行。Keras 的开...

33930

扫码关注云+社区

领取腾讯云代金券

年度创作总结 领取年终奖励