# 【机器学习】监督学习之KNN

K最近邻（k-Nearest Neighbor，KNN）分类算法可以说是最简单的机器学习算法了。它采用测量不同特征值之间的距离方法进行分类。它的思想很简单：如果一个样本在特征空间中的k个最相似（即特征空间中最邻近）的样本中的大多数属于某一个类别，则该样本也属于这个类别。

KNN算法中，所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。

1）计算已知类别数据集中的点与当前点之间的距离；

2）按照距离递增次序排序；

3）选取与当前点距离最小的k个点；

4）确定前k个点所在类别的出现频率；

5）返回前k个点出现频率最高的类别作为当前点的预测分类。

2.1、kNN基础实践

[python]

1. #########################################
2. # kNN: k Nearest Neighbors
3. # Input: newInput: vector to compare to existing dataset (1xN)
4. # dataSet: size m data set of known vectors (NxM)
5. # labels: data set labels (1xM vector)
6. # k: number of neighbors to use for comparison
7. # Output: the most popular class label
8. #########################################
9. from numpy import *
10. import operator
11. # create a dataset which contains 4 samples with 2 classes
12. def createDataSet():
13. # create a matrix: each row as a sample
14. group = array([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]])
15. labels = ['A', 'A', 'B', 'B'] # four samples and two classes
16. return group, labels
17. # classify using kNN
18. def kNNClassify(newInput, dataSet, labels, k):
19. numSamples = dataSet.shape[0] # shape[0] stands for the num of row
20. ## step 1: calculate Euclidean distance
21. # tile(A, reps): Construct an array by repeating A reps times
22. # the following copy numSamples rows for dataSet
23. diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise
24. squaredDiff = diff ** 2 # squared for the subtract
25. squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row
26. distance = squaredDist ** 0.5
27. ## step 2: sort the distance
28. # argsort() returns the indices that would sort an array in a ascending order
29. sortedDistIndices = argsort(distance)
30. classCount = {} # define a dictionary (can be append element)
31. for i in xrange(k):
32. ## step 3: choose the min k distance
33. voteLabel = labels[sortedDistIndices[i]]
34. ## step 4: count the times labels occur
35. # when the key voteLabel is not in dictionary classCount, get()
36. # will return 0
37. classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
38. ## step 5: the max voted class will return
39. maxCount = 0
40. for key, value in classCount.items():
41. if value > maxCount:
42. maxCount = value
43. maxIndex = key
44. return maxIndex

[python]

1. import kNN
2. from numpy import *
3. dataSet, labels = kNN.createDataSet()
4. testX = array([1.2, 1.0])
5. k = 3
6. outputLabel = kNN.kNNClassify(testX, dataSet, labels, 3)
7. print "Your input is:", testX, "and classified to class: ", outputLabel
8. testX = array([0.1, 0.3])
9. outputLabel = kNN.kNNClassify(testX, dataSet, labels, 3)
10. print "Your input is:", testX, "and classified to class: ", outputLabel

[python]

1. Your input is: [ 1.2 1.0] and classified to class: A
2. Your input is: [ 0.1 0.3] and classified to class: B

2.2、kNN进阶

[python]

1. #########################################
2. # kNN: k Nearest Neighbors
3. # Input: inX: vector to compare to existing dataset (1xN)
4. # dataSet: size m data set of known vectors (NxM)
5. # labels: data set labels (1xM vector)
6. # k: number of neighbors to use for comparison
7. # Output: the most popular class label
8. #########################################
9. from numpy import *
10. import operator
11. import os
12. # classify using kNN
13. def kNNClassify(newInput, dataSet, labels, k):
14. numSamples = dataSet.shape[0] # shape[0] stands for the num of row
15. ## step 1: calculate Euclidean distance
16. # tile(A, reps): Construct an array by repeating A reps times
17. # the following copy numSamples rows for dataSet
18. diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise
19. squaredDiff = diff ** 2 # squared for the subtract
20. squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row
21. distance = squaredDist ** 0.5
22. ## step 2: sort the distance
23. # argsort() returns the indices that would sort an array in a ascending order
24. sortedDistIndices = argsort(distance)
25. classCount = {} # define a dictionary (can be append element)
26. for i in xrange(k):
27. ## step 3: choose the min k distance
28. voteLabel = labels[sortedDistIndices[i]]
29. ## step 4: count the times labels occur
30. # when the key voteLabel is not in dictionary classCount, get()
31. # will return 0
32. classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
33. ## step 5: the max voted class will return
34. maxCount = 0
35. for key, value in classCount.items():
36. if value > maxCount:
37. maxCount = value
38. maxIndex = key
39. return maxIndex
40. # convert image to vector
41. def img2vector(filename):
42. rows = 32
43. cols = 32
44. imgVector = zeros((1, rows * cols))
45. fileIn = open(filename)
46. for row in xrange(rows):
48. for col in xrange(cols):
49. imgVector[0, row * 32 + col] = int(lineStr[col])
50. return imgVector
53. ## step 1: Getting training set
54. print "---Getting training set..."
55. dataSetDir = 'E:/Python/Machine Learning in Action/'
56. trainingFileList = os.listdir(dataSetDir + 'trainingDigits') # load the training set
57. numSamples = len(trainingFileList)
58. train_x = zeros((numSamples, 1024))
59. train_y = []
60. for i in xrange(numSamples):
61. filename = trainingFileList[i]
62. # get train_x
63. train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename)
64. # get label from file name such as "1_18.txt"
65. label = int(filename.split('_')[0]) # return 1
66. train_y.append(label)
67. ## step 2: Getting testing set
68. print "---Getting testing set..."
69. testingFileList = os.listdir(dataSetDir + 'testDigits') # load the testing set
70. numSamples = len(testingFileList)
71. test_x = zeros((numSamples, 1024))
72. test_y = []
73. for i in xrange(numSamples):
74. filename = testingFileList[i]
75. # get train_x
76. test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename)
77. # get label from file name such as "1_18.txt"
78. label = int(filename.split('_')[0]) # return 1
79. test_y.append(label)
80. return train_x, train_y, test_x, test_y
81. # test hand writing class
82. def testHandWritingClass():
83. ## step 1: load data
84. print "step 1: load data..."
85. train_x, train_y, test_x, test_y = loadDataSet()
86. ## step 2: training...
87. print "step 2: training..."
88. pass
89. ## step 3: testing
90. print "step 3: testing..."
91. numTestSamples = test_x.shape[0]
92. matchCount = 0
93. for i in xrange(numTestSamples):
94. predict = kNNClassify(test_x[i], train_x, train_y, 3)
95. if predict == test_y[i]:
96. matchCount += 1
97. accuracy = float(matchCount) / numTestSamples
98. ## step 4: show the result
99. print "step 4: show the result..."
100. print 'The classify accuracy is: %.2f%%' % (accuracy * 100)

[python]

1. import kNN
2. kNN.testHandWritingClass()

[python]

2. ---Getting training set...
3. ---Getting testing set...
4. step 2: training...
5. step 3: testing...
6. step 4: show the result...
7. The classify accuracy is: 98.84%

709 篇文章115 人订阅

0 条评论

## 相关文章

3311

### 图像识别（三）cifar10.py

tf.app.flags.DEFINE_integer()等函数是添加了命令行的可选参数

3094

4155

### 图像处理中任意核卷积(matlab中conv2函数)的快速实现。

卷积其实是图像处理中最基本的操作，我们常见的一些算法比如：均值模糊、高斯模糊、锐化、Sobel、拉普拉斯、prewitt边缘检测等等一些和领域相关的算...

8478

1996

4166

### 51Nod 1289 大鱼吃小鱼(模拟，经典好题)

1289 大鱼吃小鱼            题目来源：             Codility 基准时间限制：1 秒 空间限制：131072 KB 分值: ...

3207

4993

1525

### CNN+MNIST+INPUT_DATA数字识别

TALK IS CHEAP，SHOW ME THE CODE，先从MNIST数据集下载脚本Input_data开始

3663