K -近邻算法（kNN)（二）

用户6021899

发布于 2019-08-14 16:49:37

7090

发布于 2019-08-14 16:49:37

文章被收录于专栏：Python编程 pyqt matplotlib

本篇介绍用kNN算法解决手写数字的图片识别问题。数据集使用的是MNIST手写数字数据集，它常被用来作为深度学习的入门案例。数据集下载网址：http://yann.lecun.com/exdb/mnist/

其训练集共有60000个样本（图片和标签），测试集有10000个样本，已足够庞大。

上述4个文件分别是测试集标签、训练集标签、测试集图片、训练集图片。原来都是2进制的字节码，为了方便讲解，我已将图片数据转为 jpg图片（参考下面的代码，此代码与kNN关系不大，可略过）。每个图片均是是 28x28像素的灰度图。

import  tensorflow as tf
import tensorflow.examples.tutorials.mnist.input_data as input_data
import os
mnist = input_data.read_data_sets(r"E:\Python36\my tensorflow\MNIST_data",one_hot =True)
#print(mnist.test.images.shape)  # 打印出测试集数据结构 (10000, 784)
#print(mnist.test.labels.shape)  # 打印出测试集标签结构(10000, 10)
N=  mnist.test.images.shape[0]
from PIL import Image
import numpy as np
# np.array将数据转化为数组 np.reshape将一维数组reshape成(28*28)  mnist.train.images[1]取出第二张图片 dtype转换为int8数据类型
for i in range(N):
    im_data = np.array(np.reshape(mnist.test.images[i], (28, 28)) * 255, dtype=np.int8)  # 取第一张图片的 数组
    # 将数组还原成图片 Image.fromarray方法 传入数组 和 通道
    img = Image.fromarray(im_data, 'L')
    img.save(r'E:\Python36\MNIST picture\test\%d.jpg'%(i))

从图片和标签二进制文件中获取数据集的代码如下：

def get_dataSet(self, imgFolder, labelFile):
        f = open(labelFile, "rb")
        magic = f.read(4)#前4 byte 是 幻数
        n = int.from_bytes(f.read(4), byteorder='big')# 第二个4 byte 表示 label的数量，即样本数
        labels= np.fromfile(f, dtype ="u1",count=-1,  sep='')
        f.close()
        N = labels.shape [0] # N等于n，表示 label的数量，即样本数，60000
        #每张图片28x28像素
        dataSet = np.zeros((N, self.rows, self.columns), dtype = np.int8)# Nx28x28
        #N = 3  #for debug
        for i in range(N):
            picture_path = os.path.join(imgFolder, "%d.jpg" % i)
            picture_data = matplotlib.image.imread(picture_path,"jpg")
            picture_data = self.convert()(picture_data) #灰度图转二值图（黑白图）
            #print(picture_data)
            dataSet[i] = picture_data
        return dataSet, labels

为了提高极高精度并减少计算量，代码中已用阈值50将灰度图（像素灰度值0~255）转为二值图（纯黑0，纯白1）。因为每个特征（28x28个特征）的范围均是1，所以本例无需对数据归一化处理。

完整的代码如下：

#kNN on MINIST data
# python version: 3.6
import os
import numpy as np
import matplotlib.image

class KNN():
    def __init__(self, rows =28, columns =28 ):
        #图片 像素的行数和列数
        self.rows = rows
        self.columns = columns
    
    def convert(self,threshold = 50):
        #threshold灰度图转二值图的阈值
        return  np.frompyfunc(lambda x: 1 if x >threshold else 0, 1, 1)
    
    def get_dataSet(self, imgFolder, labelFile):
        f = open(labelFile, "rb")
        magic = f.read(4)#前4 byte 是 幻数
        n = int.from_bytes(f.read(4), byteorder='big')# 第二个4 byte 表示 label的数量，即样本数
        labels= np.fromfile(f, dtype ="u1",count=-1,  sep='')
        f.close()
        N = labels.shape [0] # N等于n，表示 label的数量，即样本数，60000
        #每张图片28x28像素
        dataSet = np.zeros((N, self.rows, self.columns), dtype = np.int8)
        #N = 3  #for debug
        for i in range(N):
            picture_path = os.path.join(imgFolder, "%d.jpg" % i)
            picture_data = matplotlib.image.imread(picture_path,"jpg")
            picture_data = self.convert()(picture_data) #灰度图转二值图（黑白图）
            #print(picture_data)
            dataSet[i] = picture_data
        return dataSet, labels
    
    def autoNorm(self, dataSet):
        '''本例中每个样本，每个像素rang相同，不用归一化'''
        pass
    
    def classify(self, X,  dataSet, labels, k=3):
        #n = dataSet.shape[0] #训练集样本个数
        diff = dataSet - X #满足广播条件，shape不同也能运算
        sqr_diff = diff**2
        #sqrDistance = sqr_diff.sum(axis = (1,2)) #！！每个样本全部像素点的差 求和
        #distance = sqrDistance**0.5 #计算出了X与每个样本的距离
        distance = sqr_diff.sum(axis = (1,2))  #不开根号不影响结果
        sortedDistIndicies = distance.argsort() # 按值的大小（值从小到大）返回对应的索引
        
        classCount = {} #分类计数字典
        for i in range(k):
            voteLabel = labels[ sortedDistIndicies[i] ] #k个距离最小样本对应的标签
            voteLabel = int(voteLabel)# numpy 整数转为python 整形.( numpy数组非哈希不能做键）
            classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 #有则加1，则设为（0+1）
        #字典转列表，按列表的第2个元素 从大到小排序
        import operator
        sortedClassCount = sorted(classCount.items() , key = operator.itemgetter(1), reverse = True)
        #print(sortedClassCount)
        return sortedClassCount[0][0]
   
            
knn = KNN()
#训练集，60000 样本
trainSet, trainLabels = knn.get_dataSet(imgFolder =r"E:\Python36\MNIST picture\train", labelFile =r"E:\Python36\my tensorflow\MNIST_data\train-labels.idx1-ubyte")
#测试集, 10000 样本
testSet, testLabels = knn.get_dataSet(imgFolder =r"E:\Python36\MNIST picture\test", labelFile =r"E:\Python36\my tensorflow\MNIST_data\t10k-labels.idx1-ubyte")

#KNN 的一大缺点是每个新样本都要重新计算

#在测试集（10000个样本）中测试:
m = 100 # 因时间限制，只测试了前m个样本
errors = 0 # 错判次数计数
for i in range(m):
    X_path =  os.path.join(r"E:\Python36\MNIST picture\test", "%d.jpg"% i )
    X = matplotlib.image.imread(X_path,"jpg")
    X =  knn.convert()(X)#转为二值图
    Y_predict = knn.classify( X,  trainSet, trainLabels, k=5) #注意程序中结果用整形表示
    Y = int(testLabels[i])
    print("Y_predict is :" , Y_predict, "，   Y_true is :" , Y)
    if Y_predict != Y:
        errors += 1
        
accuracy = (1 - errors / m)
print("accuracy: %.2f %%" % (100*accuracy))

因为时间有限，我只测试了测试集中前100个样本，从结果看，准确度高达 98%。还可通过调整k或调整转二值图时使用的阈值来优化。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-05-11，如有侵权请联系 cloudcommunity@tencent.com 删除

腾讯云测试服务

本文分享自 Python可视化编程机器学习OpenCV 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

腾讯云测试服务

登录后参与评论

0 条评论

热度

K -近邻算法（kNN)（二）

K -近邻算法（kNN)（二）

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐