# DBSCAN算法

## DBSCAN

### DBSCAN 是什么？

DBSCAN算法是对数据样本进行划分的聚类算法，且我们事先并不知道数据样本的标签，是一种非监督的聚类算法。在wiki pedia的定义中原文是这样的：

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.1 It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

## DBSCAN构建思路

### 物以类聚，人以群分

P(n|θ=0.8)=Cn800.8n0.280−n

P(n| \theta = 0.8) = C_{80}^{n}0.8^n0.2^{80-n} 由此得到了，如下的伯努利概率分布函数：

N(μ,δ2)=1δ2π−−√e−(x−μ)22δ2

N(\mu,\delta^2) = \frac {1}{\delta\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\delta^2}}

The parameters minPts and ε can be set by a domain expert, if the data is well understood.

### 特征维度与类别的关系

y轴代表了学生的学习时间，数值越高表示学习时间越多。有了第二维特征，我们可以发现数据在特征空间中隐约呈现出了四类。而恰巧这些数据是能够被区分成四类数据： 1. 红点可以表示为勤奋但由于成绩中等，表示为笨蛋 2. 绿点可以表示为不好学但成绩中等，表示为一般聪明 3. 蓝叉可以表示为好学且成绩优异，表示为一般聪明 4. 黄叉可以表示为不好学但成绩优异，表示为非常聪明

## Code Time

import numpy as np
import matplotlib.pyplot as plt

# A校生
mu1, sigma1 = 70, 4.2
x1 = mu1 + sigma1 * np.random.randn(400)

iq_mu1,iq_sigma1 = 120,15 # 学习时间长
iq_mu2,iq_sigma2 = 60,10  # 学习时间短

tmp1 = iq_mu1 + iq_sigma1 * np.random.randn(350)
tmp2 = iq_mu2 + iq_sigma2 * np.random.randn(50)

y1 = np.append(tmp1,tmp2)

# B校生
mu2,sigma2 = 90,2.1
x2 = mu2 + sigma2 * np.random.randn(300)

tmp3 = iq_mu1 + iq_sigma1 * np.random.randn(50)
tmp4 = iq_mu2 + iq_sigma2 * np.random.randn(250)

y2 = np.append(tmp3,tmp4)

# 水平组合
dataSet1 = np.column_stack((x1,y1))
dataSet2 = np.column_stack((x2,y2))

dataSet = np.vstack((dataSet1,dataSet2))

# DBSCAN算法
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

db = DBSCAN(eps=5, min_samples=10).fit(dataSet)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

# function to calculate distance
def calDistance(p1,p2):
return ((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)**(0.5)

# epsilon , MinPts
epsi = 10
minPts = 8

# find out the core points
other_points = []
core_points = []
plotted_points =[]
all_points = []

# array 转 list
for point in dataSet:
all_points.append(list(point))

# find core points
for point in all_points:
point.append(0)
total =0

#  计算与其他点的距离
for otherPoint in all_points:
distance = calDistance(otherPoint,point)
if distance <= epsi:
total +=1

if total > minPts:
core_points.append(point)
plotted_points.append(point)
else:
other_points.append(point)

# find border points
border_points =[]

# 从other_points中找寻离核心点距离小于epsi的点
for core in core_points:
for other in other_points:
if calDistance(core,other) <= epsi:
border_points.append(other)
plotted_points.append(other)

# implement the algorithm
cluster_label = 0

# 只是标记了核心点，但还没有形成簇
for point in core_points:
if point[2] ==0: # 没有遍历过的点
cluster_label +=1
point[2] = cluster_label # 形成了簇

# plotted_points 包含core_points and border_points
for point2 in plotted_points:
distance = calDistance(point2,point)
# 针对那些还没形成簇的点 且 离刚形成簇的点的核心距离小于epsi，则归为一类
if point2[2] ==0 and distance <= epsi:
point2[2] = point[2]

# after the points are assigned correnponding labels,we group them
cluster_list = defaultdict(lambda:[[],[]])
# 记录了类别信息
for point in plotted_points:
cluster_list[point[2]][0].append(point[0])
cluster_list[point[2]][1].append(point[1])

markers = ['+','*','.','d','^','v','>','<','p']

# plotting the cluster
i = 0
# print(cluster_list)
for value in cluster_list:
cluster = cluster_list[value]
plt.plot(cluster[0],cluster[1],markers[i])
i = i%8+1

#plot the noise points as well
noise_points=[]

for point in all_points:
if not point in core_points and not point in border_points:
noise_points.append(point)

noisex=[]
noisey=[]
for point in noise_points:
noisex.append(point[0])
noisey.append(point[1])
plt.plot(noisex, noisey, "x")

plt.axis((50,100,0,160))
plt.show()

## 参考文献

0 条评论

• ### POJ 刷题系列：3094. Quicksum

POJ 刷题系列：3094. Quicksum 传送门：POJ 3094. Quicksum 题意： 给定一个字符串S，且A = 1， B = 2…. 空格 ...

• ### LWC 51：682. Baseball Game

LWC 51：682. Baseball Game 传送门：682. Baseball Game Problem: You’re now a baseball...

• ### 挑战程序竞赛系列（33）：POJ 2991 Crane

版权声明：本文为博主原创文章，未经博主允许不得转载。 https://blog.csdn.n...

• ### 腾讯DCI上线基于集中控制的SR-TE方案

＂鹅厂网事＂由深圳市腾讯计算机系统有限公司技术工程事业群网络平台部运营，我们希望与业界各位志同道合的伙伴交流切磋最新的网络、服务器行业动态信息，同时分享腾讯在网...

• ### 腾讯 DCI 上线基于集中控制的 SR-TE 方案

经过三年多的研究探索及15个月的开发测试，基于 Segment Routing 技术和 SDN 思想，率先实现了对10w服务器级别的 IDC 园区间通讯。

• ### 腾讯DCI上线基于集中控制的SR-TE方案

交通拥堵已经成为当今时代与每个人息息相关的问题，它直接影响了我们在现代社会的生活体验。传统的分布式交警管控方式，已无法解决急速扩张的汽车保有量与紧张的公路资源之...

• ### Educational Codeforces Round 60 (Rated for Div. 2) A. Best Subsegment(思维)

版权声明：欢迎转载，若转载，请标明出处，如有错误，请指点，也欢迎大佬们给出优化方法 https://blog.csdn.net/Charles_Zaqd...