聚类算法是无监督学习中的重要部分,聚类算法包括K-means、k-mediods以及DBSCAN等。DBSCAN是基于距离测量(通常为欧几里德距离)和最小点数将彼此接近的点组合在一起。DBSCAN算法可以用来查找难以手动查找的数据中的关联和结构,通常用于生物学,医学,人物识别,管理系统等多个领域。
算法原理
DBSCAN聚类的过程像树生长一样,它从种子点开始,该种子点在eps的距离内至少具有MinPoints个点。我们沿着这些附近的点进行广度优先搜索。对于给定的点,我们检查它在半径内有多少个点。如果它的数量少于MinPoints,则此点变为叶子,我们不会继续从中增长群集。我们将其所有邻居添加到我们广度优先搜索的FIFO队列中。一旦广度优先搜索完成,我们就完成了该集群,我们永远不会重新计算其中的任何一点。我们选择一个新的任意种子点,并增长下一个集群。一直持续到所有点都已分配。DBSCAN还有一个新颖的地方,如果一个点的邻居数少于MinPoints,并且它不是另一个集群的叶节点,则它被标记为不属于任何集群的“噪声”点。噪声点被识别为选择新种子的过程的一部分 - 如果特定种子点没有足够的邻居,则将其标记为噪声点。
两个参数:eps和minpoints
DBSCAN算法主要有2个参数:
eps:两点之间的最小距离。这意味着如果两点之间的距离低于或等于该值(eps),则这些点被认为是相邻。如果选择的eps值太小,则很大一部分数据不会聚集。它将被视为异常值,因为不满足创建密集区域的点数。如果选择的值太大,则群集会被合并,这样会造成大多数对象处于同一群集中。因此应该根据数据集的距离来选择eps,一般来说eps值尽量取小一点。
minPoints:表示形成密集区域的最小点数。例如,如果我们将minPoints参数设置为5,那么我们需要至少5个点来形成密集区域。作为一般规则,minPoints可以从数据集中的多个维度(D)导出,因为minPoints≥D+ 1.对于具有噪声的数据集,较大的minPoints值通常更好,并且将形成更完美的簇。minPoints的最小值必须为3,数据集越大,对应选择的minPoints值越大。
区别于K-means
DBSCAN与K-means不同的是
Python实现
下面通过Python代码实现来帮助大家更好地理解DBSCAN的算法原理,实现的重点在于说明算法,例如距离的优化计算。详细代码可以参见Github。
Github
https://github.com/chrisjmccormick/dbscan
DBSCAN代码实现如下:
import numpy
def MyDBSCAN(D, eps, MinPts):
"""
Cluster the dataset `D` using the DBSCAN algorithm.
MyDBSCAN takes a dataset `D` (a list of vectors), a threshold distance
`eps`, and a required number of points `MinPts`.
It will return a list of cluster labels. The label -1 means noise, and then
the clusters are numbered starting from 1.
"""
# This list will hold the final cluster assignment for each point in D.
# There are two reserved values:
# -1 - Indicates a noise point
# 0 - Means the point hasn't been considered yet.
# Initially all labels are 0.
labels = [0]*len(D)
# C is the ID of the current cluster.
C = 0
# This outer loop is just responsible for picking new seed points--a point
# from which to grow a new cluster.
# Once a valid seed point is found, a new cluster is created, and the
# cluster growth is all handled by the 'expandCluster' routine.
# For each point P in the Dataset D...
# ('P' is the index of the datapoint, rather than the datapoint itself.)
for P in range(0, len(D)):
# Only points that have not already been claimed can be picked as new
# seed points.
# If the point's label is not 0, continue to the next point.
if not (labels[P] == 0):
continue
# Find all of P's neighboring points.
NeighborPts = regionQuery(D, P, eps)
# If the number is below MinPts, this point is noise.
# This is the only condition under which a point is labeled
# NOISE--when it's not a valid seed point. A NOISE point may later
# be picked up by another cluster as a boundary point (this is the only
# condition under which a cluster label can change--from NOISE to
# something else).
if len(NeighborPts) < MinPts:
labels[P] = -1
# Otherwise, if there are at least MinPts nearby, use this point as the
# seed for a new cluster.
else:
# Get the next cluster label.
C += 1
# Assing the label to our seed point.
labels[P] = C
# Grow the cluster from the seed point.
growCluster(D, labels, P, C, eps, MinPts)
# All data has been clustered!
return labels
def growCluster(D, labels, P, C, eps, MinPts):
"""
Grow a new cluster with label `C` from the seed point `P`.
This function searches through the dataset to find all points that belong
to this new cluster. When this function returns, cluster `C` is complete.
Parameters:
`D` - The dataset (a list of vectors)
`labels` - List storing the cluster labels for all dataset points
`P` - Index of the seed point for this new cluster
`C` - The label for this new cluster.
`eps` - Threshold distance
`MinPts` - Minimum required number of neighbors
"""
SearchQueue = [P]
# For each point in the queue:
# 1. Determine whether it is a branch or a leaf
# 2. For branch points, add their unclaimed neighbors to the search queue
i = 0
while i < len(SearchQueue):
# Get the next point from the queue.
P = SearchQueue[i]
# Find all the neighbors of P
NeighborPts = regionQuery(D, P, eps)
# If the number of neighbors is below the minimum, then this is a leaf
# point and we move to the next point in the queue.
if len(NeighborPts) < MinPts:
i += 1
continue
# Otherwise, we have the minimum number of neighbors, and this is a
# branch point.
# For each of the neighbors...
for Pn in NeighborPts:
# If Pn was labelled NOISE during the seed search, then we
# know it's not a branch point (it doesn't have enough
# neighbors), so make it a leaf point of cluster C and move on.
if labels[Pn] == -1:
labels[Pn] = C
# Otherwise, if Pn isn't already claimed, claim it as part of
# C and add it to the search queue.
elif labels[Pn] == 0:
# Add Pn to cluster C.
labels[Pn] = C
# Add Pn to the SearchQueue.
SearchQueue.append(Pn)
# Advance to the next point in the FIFO queue.
i += 1
# We've finished growing cluster C!
def regionQuery(D, P, eps):
"""
Find all points in dataset `D` within distance `eps` of point `P`.
This function calculates the distance between a point P and every other
point in the dataset, and then returns only those points which are within a
threshold distance `eps`.
"""
neighbors = []
# For each point in the dataset...
for Pn in range(0, len(D)):
# If the distance is below the threshold, add it to the neighbors list.
if numpy.linalg.norm(D[P] - D[Pn]) < eps:
neighbors.append(Pn)
return neighbors